Title: | Estimation in Nonprobability Sampling |
---|---|
Description: | Different inference procedures are proposed in the literature to correct for selection bias that might be introduced with non-random selection mechanisms. A class of methods to correct for selection bias is to apply a statistical model to predict the units not in the sample (super-population modeling). Other studies use calibration or Statistical Matching (statistically match nonprobability and probability samples). To date, the more relevant methods are weighting by Propensity Score Adjustment (PSA). The Propensity Score Adjustment method was originally developed to construct weights by estimating response probabilities and using them in Horvitz–Thompson type estimators. This method is usually used by combining a non-probability sample with a reference sample to construct propensity models for the non-probability sample. Calibration can be used in a posterior way to adding information of auxiliary variables. Propensity scores in PSA are usually estimated using logistic regression models. Machine learning classification algorithms can be used as alternatives for logistic regression as a technique to estimate propensities. The package 'NonProbEst' implements some of these methods and thus provides a wide options to work with data coming from a non-probabilistic sample. |
Authors: | Luis Castro Martín <[email protected]>, Ramón Ferri García <[email protected]> and María del Mar Rueda <[email protected]> |
Maintainer: | Luis Castro Martín <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.4 |
Built: | 2024-11-21 06:22:54 UTC |
Source: | https://github.com/cran/NonProbEst |
Calculates the calibration weights from a disjunct matrix of covariates, a vector of population totals and a vector of initial weights.
calib_weights(Xs, totals, initial_weights, N, ...)
calib_weights(Xs, totals, initial_weights, N, ...)
Xs |
Matrix of calibration variables. |
totals |
A vector containing population totals for each column (class) of the calibration variables matrix. |
initial_weights |
A vector containing the initial weights for each individual. |
N |
Integer indicating the population size. |
... |
Further arguments to be passed to the 'calib' function from the 'sampling' package. |
The function uses the 'calib' function from the 'sampling' package for the estimation of g-weights, which are multiplied by the initial weights to obtain the final calibration weights. The initial weights can be calculated previously from the propensities for any of the implemented methods (see functions lee_weights
, sc_weights
, valliant_weights
, vd_weights
). The population size is used to scale said initial weights so they are easier to calibrate.
A vector with the corresponding weights.
n = nrow(sampleNP) N = 50000 language_total = 45429 covariates = c("education_primaria", "education_secundaria", "age", "sex") pi = propensities(sampleNP, sampleP, covariates, algorithm = "glm", smooth = FALSE) wi = sc_weights(pi$convenience) calib_weights(sampleNP$language, language_total, wi, N, method = "raking")
n = nrow(sampleNP) N = 50000 language_total = 45429 covariates = c("education_primaria", "education_secundaria", "age", "sex") pi = propensities(sampleNP, sampleP, covariates, algorithm = "glm", smooth = FALSE) wi = sc_weights(pi$convenience) calib_weights(sampleNP$language, language_total, wi, N, method = "raking")
Calculates the confidence interval for the estimator considered.
confidence_interval(estimation, std_dev, confidence = 0.95)
confidence_interval(estimation, std_dev, confidence = 0.95)
estimation |
A numeric value specifying the point estimation. |
std_dev |
A numeric value specifying the standard deviation of the point estimation. |
confidence |
A numeric value between 0 and 1 specifying the confidence level, taken as 1 - alpha (1 - Type I error). By default, its value is 0.95. |
A vector containing the lower and upper bounds.
covariates = c("education_primaria","education_secundaria", "age", "sex") pi = propensities(sampleNP, sampleP, covariates, algorithm = "glm", smooth = FALSE) psa_weights = sc_weights(pi$convenience) N = 50000 Y_est = total_estimation(sampleNP, psa_weights, estimated_vars = "vote_pens", N = N) VY_est = fast_jackknife_variance(sampleNP, psa_weights, estimated_vars = "vote_pens") * N^2 confidence_interval(Y_est, sqrt(VY_est), confidence = 0.90)
covariates = c("education_primaria","education_secundaria", "age", "sex") pi = propensities(sampleNP, sampleP, covariates, algorithm = "glm", smooth = FALSE) psa_weights = sc_weights(pi$convenience) N = 50000 Y_est = total_estimation(sampleNP, psa_weights, estimated_vars = "vote_pens", N = N) VY_est = fast_jackknife_variance(sampleNP, psa_weights, estimated_vars = "vote_pens") * N^2 confidence_interval(Y_est, sqrt(VY_est), confidence = 0.90)
Calculates the variance of a given estimator by Leave-One-Out Jackknife (Quenouille, 1956) with the original adjusted weights.
fast_jackknife_variance(sample, weights, estimated_vars, N = NULL)
fast_jackknife_variance(sample, weights, estimated_vars, N = NULL)
sample |
A data frame containing the sample. |
weights |
A vector containing the pre-calculated weights. |
estimated_vars |
A string vector specifying the variables for which the estimators' variance are to be estimated. |
N |
Integer indicating the population size. Optional. |
The variance estimation is performed by eliminating an individual at each iteration with its corresponding weight and estimating the mean of the corresponding subsample, which is further used in the Jackknife formula as the usual procedure. The calculation of variance estimates through this procedure might take less computation time but also might not take into account the variance of the weighting method.
A vector containing the resulting variance for each variable.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353-360.
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) fast_jackknife_variance(sampleNP, psa_weights, c("vote_pens"), 50000)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) fast_jackknife_variance(sampleNP, psa_weights, c("vote_pens"), 50000)
Calculates the variance of a given estimator by Leave-One-Out Jackknife (Quenouille, 1956) with reweighting in each iteration.
generic_jackknife_variance(sample, estimator, N = NULL)
generic_jackknife_variance(sample, estimator, N = NULL)
sample |
Data frame containing the non-probabilistic sample. |
estimator |
Function that, given a sample as a parameter, returns an estimation. |
N |
Integer indicating the population size. Optional. |
The estimation of the variance requires a recalculation of the estimates in each iteration which might involve weighting adjustments, leading to an increase in computation time. It is expected that the estimated variance captures the weighting adjustments' variability and the estimator's variability.
The resulting variance.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353-360.
covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) vote_gen_estimator = function(sample) { model_based(sample, population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet') } generic_jackknife_variance(sampleNP, vote_gen_estimator)
covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) vote_gen_estimator = function(sample) { model_based(sample, population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet') } generic_jackknife_variance(sampleNP, vote_gen_estimator)
Calculates the variance of PSA by Leave-One-Out Jackknife (Quenouille, 1956) with reweighting in each iteration.
jackknife_variance( estimated_vars, convenience_sample, reference_sample, covariates, N = NULL, algorithm = "glm", smooth = FALSE, proc = NULL, trControl = trainControl(classProbs = TRUE), weighting.func = "sc", g = 5, calib = FALSE, calib_vars = NULL, totals = NULL, args.calib = NULL, ... )
jackknife_variance( estimated_vars, convenience_sample, reference_sample, covariates, N = NULL, algorithm = "glm", smooth = FALSE, proc = NULL, trControl = trainControl(classProbs = TRUE), weighting.func = "sc", g = 5, calib = FALSE, calib_vars = NULL, totals = NULL, args.calib = NULL, ... )
estimated_vars |
A string vector specifying the variables for which the estimators' variance are to be estimated. |
convenience_sample |
Data frame containing the non-probabilistic sample. |
reference_sample |
Data frame containing the probabilistic sample. |
covariates |
String vector specifying the common variables to use for training. |
N |
Integer indicating the population size. Optional. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). By default, its value is "glm" (logistic regression). |
smooth |
A logical value; if TRUE, propensity estimates pi_i are smoothed applying the formula (1000*pi_i + 0.5)/1001 |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
trControl |
A trainControl specifying the computational nuances of the train function. |
weighting.func |
A string specifying which function should be used to compute weights from propensity scores. Available functions are the following:
|
g |
If |
calib |
A logical value; if TRUE, PSA weights are used as initial weights for calibration. By default, its value is FALSE. |
calib_vars |
A string or vector of strings specifying the variables to be used for calibration. By default, its value is NULL. |
totals |
A vector containing population totals for each column (class) of the calibration variables matrix. Ignored if |
args.calib |
A list containing further arguments to be passed to the calib_weights function. |
... |
Further parameters to be passed to the train function. |
The estimation of the variance requires a recalculation of the estimates in each iteration which might involve weighting adjustments, leading to an increase in computation time. It is expected that the estimated variance captures the weighting adjustments' variability and the estimator's variability.
The resulting variance.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353-360.
#A simple example without calibration and default parameters covariates = c("education_primaria", "education_secundaria") jackknife_variance("vote_pens",sampleNP, sampleP, covariates) #An example with linear calibration and default parameters covariates = c("education_primaria", "education_secundaria") calib_vars = c("age", "sex") totals = c(2544377, 24284) jackknife_variance("vote_pens",sampleNP, sampleP, covariates, calib = T, calib_vars, totals, args.calib = list(method = "linear"))
#A simple example without calibration and default parameters covariates = c("education_primaria", "education_secundaria") jackknife_variance("vote_pens",sampleNP, sampleP, covariates) #An example with linear calibration and default parameters covariates = c("education_primaria", "education_secundaria") calib_vars = c("age", "sex") totals = c(2544377, 24284) jackknife_variance("vote_pens",sampleNP, sampleP, covariates, calib = T, calib_vars, totals, args.calib = list(method = "linear"))
Computes weights from propensity estimates using the propensity stratification design weights averaging formula introduced in Lee (2006) and Lee and Valliant (2009).
lee_weights(convenience_propensities, reference_propensities, g = 5)
lee_weights(convenience_propensities, reference_propensities, g = 5)
convenience_propensities |
A vector with the propensities associated with the convenience sample. |
reference_propensities |
A vector with the propensities associated with the reference sample. |
g |
The number of strata to use; by default, its value is 5. |
The function takes the vector of propensities and calculates the weights to be applied in the Horvitz-Thompson estimator using the formula that can be found in Lee (2006) and Lee and Valliant (2009). The vector of propensities is divided in g strata (ideally five according to Cochran, 1968) aiming to have individuals with similar propensities in each strata. After the stratification, weight is calculated as follows for an individual i:
where represents the strata to which i belongs,
and
are the number of individuals in the
strata from the reference and the convenience sample respectively, and
and
are the sample sizes for the reference and the convenience sample respectively.
A vector with the corresponding weights.
Lee, S. (2006). Propensity score adjustment as a weighting scheme for volunteer panel web surveys. Journal of official statistics, 22(2), 329.
Lee, S., & Valliant, R. (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociological Methods & Research, 37(3), 319-343.
Cochran, W. G. (1968). The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies. Biometrics, 24(2), 295-313
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) lee_weights(data_propensities$convenience, data_propensities$reference)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) lee_weights(data_propensities$convenience, data_propensities$reference)
It uses the matching method introduced by Rivers (2007). The idea is to model the relationship between y_k and x_k using the convenience sample in order to predict y_k for the reference sample. You can then predict the total using the 'total_estimation' method.
matching( convenience_sample, reference_sample, covariates, estimated_var, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
matching( convenience_sample, reference_sample, covariates, estimated_var, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
convenience_sample |
Data frame containing the non-probabilistic sample. |
reference_sample |
Data frame containing the probabilistic sample. |
covariates |
String vector specifying the common variables to use for training. |
estimated_var |
String specifying the variable to estimate. |
positive_label |
String specifying the label to be considered positive if the estimated variable is categorical. Leave it as the default NULL otherwise. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
... |
Further parameters to be passed to the train function. |
Training of the models is done via the 'caret' package. The algorithm specified in algorithm
must match one of the names in the list of algorithms supported by 'caret'. If the estimated variable is categorical, probabilities are returned.
A vector containing the estimated responses for the reference sample.
Rivers, D. (2007). Sampling for Web Surveys. Presented in Joint Statistical Meetings, Salt Lake City, UT.
#Simple example with default parameters N = 50000 covariates = c("education_primaria", "education_secundaria") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) estimated_votes = data.frame( vote_gen = matching(sampleNP, sampleP, covariates, "vote_gen", 'T') ) total_estimation(estimated_votes, N / nrow(estimated_votes), c("vote_gen"), N)
#Simple example with default parameters N = 50000 covariates = c("education_primaria", "education_secundaria") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) estimated_votes = data.frame( vote_gen = matching(sampleNP, sampleP, covariates, "vote_gen", 'T') ) total_estimation(estimated_votes, N / nrow(estimated_votes), c("vote_gen"), N)
Estimates the means for the specified variables measured in a sample given some pre-calculated weights.
mean_estimation(sample, weights, estimated_vars, N = NULL)
mean_estimation(sample, weights, estimated_vars, N = NULL)
sample |
A data frame containing the sample with the variables for which the means are to be calculated. |
weights |
A vector of pre-calculated weights. |
estimated_vars |
String vector specifying the variables in the sample to be estimated. |
N |
An integer specifying the population size (optional). |
A vector with the corresponding estimations.
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) mean_estimation(sampleNP, psa_weights, c("vote_pens"))
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) mean_estimation(sampleNP, psa_weights, c("vote_pens"))
It uses the model assisted estimator introduced by Särndal et al. (1992).
model_assisted( sample_data, weights, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
model_assisted( sample_data, weights, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
sample_data |
Data frame containing the sample. |
weights |
Vector containing the sample weights. |
full_data |
Data frame containing all the individuals contained in the population. |
covariates |
String vector specifying the common variables to use for training. |
estimated_var |
String specifying the variable to estimate. |
estimate_mean |
Boolean specifying whether the mean estimation should be returned. Otherwise, the total estimation is returned by default. |
positive_label |
String specifying the label to be considered positive if the estimated variable is categorical. Leave it as the default NULL otherwise. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
... |
Further parameters to be passed to the train function. |
Training of the models is done via the 'caret' package. The algorithm specified in algorithm
must match one of the names in the list of algorithms supported by 'caret'.
The population total estimation (or mean if specified by the 'estimate_mean' parameter).
Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. Springer, New York.
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_assisted(sampleNP, nrow(population) / nrow(sampleNP), population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_assisted(sampleNP, nrow(population) / nrow(sampleNP), population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
It uses the model based estimator. The idea in order to estimate the population total is to add the sample responses and the predicted responses for the individuals not contained in the sample. See for example Valliant et al. (2000).
model_based( sample_data, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
model_based( sample_data, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
sample_data |
Data frame containing the sample. |
full_data |
Data frame containing all the individuals contained in the population. |
covariates |
String vector specifying the common variables to use for training. |
estimated_var |
String specifying the variable to estimate. |
estimate_mean |
Boolean specifying whether the mean estimation should be returned. Otherwise, the total estimation is returned by default. |
positive_label |
String specifying the label to be considered positive if the estimated variable is categorical. Leave it as the default NULL otherwise. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
... |
Further parameters to be passed to the train function. |
Training of the models is done via the 'caret' package. The algorithm specified in algorithm
must match one of the names in the list of algorithms supported by 'caret'.
The population total estimation (or mean if specified by the 'estimate_mean' parameter).
Valliant, R., Dorfman, A. H., & Royall, R. M. (2000) Finite population sampling and inference: a prediction approach. Wiley, New York.
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_based(sampleNP, population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_based(sampleNP, population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
It uses the model calibrated estimator introduced by Wu et al. (2001).
model_calibrated( sample_data, weights, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
model_calibrated( sample_data, weights, full_data, covariates, estimated_var, estimate_mean = FALSE, positive_label = NULL, algorithm = "glm", proc = NULL, ... )
sample_data |
Data frame containing the sample. |
weights |
Vector containing the sample weights. |
full_data |
Data frame containing all the individuals contained in the population. |
covariates |
String vector specifying the common variables to use for training. |
estimated_var |
String specifying the variable to estimate. |
estimate_mean |
Boolean specifying whether the mean estimation should be returned. Otherwise, the total estimation is returned by default. |
positive_label |
String specifying the label to be considered positive if the estimated variable is categorical. Leave it as the default NULL otherwise. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
... |
Further parameters to be passed to the train function. |
Training of the models is done via the 'caret' package. The algorithm specified in algorithm
must match one of the names in the list of algorithms supported by 'caret'.
The population total estimation (or mean if specified by the 'estimate_mean' parameter).
Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association, 96(453), 185-193.
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_calibrated(sampleNP, nrow(population) / nrow(sampleNP), population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
#Simple example covariates = c("education_primaria", "education_secundaria", "age", "sex", "language") if (is.numeric(sampleNP$vote_gen)) sampleNP$vote_gen = factor(sampleNP$vote_gen, c(0, 1), c('F', 'T')) model_calibrated(sampleNP, nrow(population) / nrow(sampleNP), population, covariates, "vote_gen", positive_label = 'T', algorithm = 'glmnet')
A dataset of a simulated fictitious population of 50,000 individuals. Further details on the generation of the dataset can be found in Ferri-García and Rueda (2018). The variables present in the dataset are the following:
education_primaria. A binary variable indicating if the highest academic level achieved by the individual is Primary Education.
education_secundaria. A binary variable indicating if the highest academic level achieved by the individual is Secondary Education.
education_terciaria. A binary variable indicating if the highest academic level achieved by the individual is Tertiary Education.
age. A numeric variable, with values ranging from 18 to 100, indicating the age of the individual.
sex. A binary variable indicating if the individual is a man.
language. A binary variable indicating if the individual is a native.
population
population
An object of class data.frame
with 50000 rows and 6 columns.
Ferri-García, R., & Rueda, M. (2018). Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys. SORT-Statistics and Operations Research Transactions, 1(2), 159-162.
Estimates the proportion of a given class or classes for the specified variables measured in a sample given some pre-calculated weights.
prop_estimation(sample, weights, estimated_vars, class, N = NULL)
prop_estimation(sample, weights, estimated_vars, class, N = NULL)
sample |
A data frame containing the sample with the variables for which the means are to be calculated. |
weights |
A vector of pre-calculated weights. |
estimated_vars |
String vector specifying the variables in the sample to be estimated. |
class |
String vector specifying which class (value) proportion is to be estimated in each variable. The i-th element of this vector corresponds to the class of which proportion is desired to estimate of the i-th variable of the vector specified in |
N |
An integer specifying the population size (optional). |
A vector with the corresponding estimations.
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) #The function will estimate the proportion of individuals #with the 0 value in vote_pens and the 1 value in vote_pir prop_estimation(sampleNP, psa_weights, c("vote_pens", "vote_pir"), c(0, 1))
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) #The function will estimate the proportion of individuals #with the 0 value in vote_pens and the 1 value in vote_pir prop_estimation(sampleNP, psa_weights, c("vote_pens", "vote_pir"), c(0, 1))
Given a convenience sample and a reference sample, computes estimates on the propensity to participate in the convenience sample based on classification models to be selected by the user.
propensities( convenience_sample, reference_sample, covariates, algorithm = "glm", smooth = FALSE, proc = NULL, trControl = trainControl(classProbs = TRUE), ... )
propensities( convenience_sample, reference_sample, covariates, algorithm = "glm", smooth = FALSE, proc = NULL, trControl = trainControl(classProbs = TRUE), ... )
convenience_sample |
Data frame containing the non-probabilistic sample. |
reference_sample |
Data frame containing the probabilistic sample. |
covariates |
String vector specifying the common variables to use for training. |
algorithm |
A string specifying which classification or regression model to use (same as caret's method). |
smooth |
A logical value; if TRUE, propensity estimates pi_i are smoothed applying the formula (1000*pi_i + 0.5)/1001 |
proc |
A string or vector of strings specifying if any of the data preprocessing techniques available in train function from 'caret' package should be applied to data prior to the propensity estimation. By default, its value is NULL and no preprocessing is applied. |
trControl |
A trainControl specifying the computational nuances of the train function. |
... |
Further parameters to be passed to the train function. |
Training of the propensity estimation models is done via the 'caret' package. The algorithm specified in algorithm
must match one of the names in the list of algorithms supported by 'caret'. Case weights are used to balance classes (for models that accept them).
The smoothing formula for propensities avoids mathematical irregularities in the calculation of sample weight when an estimated propensity is 0 or 1. Further details can be found in Buskirk and Kolenikov (2015).
A list containing 'convenience' propensities and 'reference' propensities.
Buskirk, T. D., & Kolenikov, S. (2015). Finding respondents in the forest: A comparison of logistic regression and random forest models for response propensity weighting and stratification. Survey Methods: Insights from the Field, 17.
#Simple example with default parameters covariates = c("education_primaria", "education_secundaria") propensities(sampleNP, sampleP, covariates)
#Simple example with default parameters covariates = c("education_primaria", "education_secundaria") propensities(sampleNP, sampleP, covariates)
A dataset of 1000 individuals extracted from the subpopulation of individuals with internet access in a simulated fictitious population of 50,000 individuals. This sample attempts to reproduce a case of nonprobability sampling with selection bias, as there are important differences between the potentially covered population, the covered population and the full target population. Further details on the generation of the dataset can be found in Ferri-García and Rueda (2018). The variables present in the dataset are the following:
vote_gen. A binary variable indicating if the individual vote preferences are for Party 1. This variable is related to gender.
vote_pens. A binary variable indicating if the individual vote preferences are for Party 2. This variable is related to age.
vote_pir. A binary variable indicating if the individual vote preferences are for Party 3. This variable is related to age and internet access.
education_primaria. A binary variable indicating if the highest academic level achieved by the individual is Primary Education.
education_secundaria. A binary variable indicating if the highest academic level achieved by the individual is Secondary Education.
education_terciaria. A binary variable indicating if the highest academic level achieved by the individual is Tertiary Education.
age. A numeric variable, with values ranging from 18 to 100, indicating the age of the individual.
sex. A binary variable indicating if the individual is a man.
language. A binary variable indicating if the individual is a native.
sampleNP
sampleNP
An object of class data.frame
with 1000 rows and 9 columns.
Ferri-García, R., & Rueda, M. (2018). Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys. SORT-Statistics and Operations Research Transactions, 1(2), 159-162.
A dataset of 500 individuals extracted with simple random sampling from a simulated fictitious population of 50,000 individuals. Further details on the generation of the dataset can be found in Ferri-García and Rueda (2018). The variables present in the dataset are the following:
education_primaria. A binary variable indicating if the highest academic level achieved by the individual is Primary Education.
education_secundaria. A binary variable indicating if the highest academic level achieved by the individual is Secondary Education.
education_terciaria. A binary variable indicating if the highest academic level achieved by the individual is Tertiary Education.
age. A numeric variable, with values ranging from 18 to 100, indicating the age of the individual.
sex. A binary variable indicating if the individual is a man.
sampleP
sampleP
An object of class data.frame
with 500 rows and 5 columns.
Ferri-García, R., & Rueda, M. (2018). Efficiency of propensity score adjustment and calibration on the estimation from non-probabilistic online surveys. SORT-Statistics and Operations Research Transactions, 1(2), 159-162.
Computes weights from propensity estimates using the (1 - pi_i)/pi_i formula introduced in Schonlau and Couper (2017).
sc_weights(propensities)
sc_weights(propensities)
propensities |
A vector with the propensities associated to the elements of the convenience sample. |
The function takes the vector of propensities and calculates the weights to be applied in the Hajek estimator using the formula that can be found in Schonlau and Couper (2017). For an individual i, weight is calculated as follows:
A vector with the corresponding weights.
Schonlau, M., & Couper, M. P. (2017). Options for conducting web surveys. Statistical Science, 32(2), 279-292.
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) sc_weights(data_propensities$convenience)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) sc_weights(data_propensities$convenience)
Estimates the population totals for the specified variables measured in a sample given some pre-calculated weights.
total_estimation(sample, weights, estimated_vars, N)
total_estimation(sample, weights, estimated_vars, N)
sample |
A data frame containing the sample with the variables for which the estimated population totals are to be calculated. |
weights |
A vector of pre-calculated weights. |
estimated_vars |
String vector specifying the variables in the sample to be estimated. |
N |
An integer specifying the population size. |
A vector with the corresponding estimations.
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) total_estimation(sampleNP, psa_weights, c("vote_pens"), 50000)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) psa_weights = sc_weights(data_propensities$convenience) total_estimation(sampleNP, psa_weights, c("vote_pens"), 50000)
Computes weights from propensity estimates using the 1/pi_i formula introduced in Valliant (2019).
valliant_weights(propensities)
valliant_weights(propensities)
propensities |
A vector with the propensities associated to the elements of the convenience sample. |
The function takes the vector of propensities and calculates the weights to be applied in the Hajek estimator using the formula that can be found in Valliant (2019). For an individual i, weight is calculated as follows:
A vector with the corresponding weights.
Valliant, R. (2019). Comparing Alternatives for Estimation from Nonprobability Samples. Journal of Survey Statistics and Methodology, smz003, https://doi.org/10.1093/jssam/smz003
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) valliant_weights(data_propensities$convenience)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) valliant_weights(data_propensities$convenience)
Computes weights from propensity estimates using the propensity stratification 1/p_i averaging formula introduced in Valliant and Dever (2011).
vd_weights(convenience_propensities, reference_propensities, g = 5)
vd_weights(convenience_propensities, reference_propensities, g = 5)
convenience_propensities |
A vector with the propensities associated with the convenience sample. |
reference_propensities |
A vector with the propensities associated with the reference sample. |
g |
The number of strata to use; by default, its value is 5. |
The function takes the vector of propensities and calculates the weights to be applied in the Horvitz-Thompson estimator using the formula that can be found in Valliant and Dever (2019). The vector of propensities is divided in g strata (ideally five according to Cochran, 1968) aiming to have individuals with similar propensities in each strata. After the stratification, weight is calculated as follows for an individual i:
where represents the strata to which i belongs, and
is the number of individuals in the
strata.
A vector with the corresponding weights.
Valliant, R., & Dever, J. A. (2011). Estimating propensity adjustments for volunteer web surveys. Sociological Methods & Research, 40(1), 105-137.
Cochran, W. G. (1968). The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies. Biometrics, 24(2), 295-313
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) vd_weights(data_propensities$convenience, data_propensities$reference)
covariates = c("education_primaria", "education_secundaria") data_propensities = propensities(sampleNP, sampleP, covariates) vd_weights(data_propensities$convenience, data_propensities$reference)