Psychology | Methods
The randomized response technique (RRT) is used to estimate or test prevalence rates of sensitive attributes. Due to their sensitive nature, such attributes typically elicit socially desirable responding in direct questioning techniques, leading to underestimation of—and wrong inferences about—population prevalences. The RRT protects individual anonymity by adding random noise to single answers, thus overcoming response biases (Warner, 1965).
Due to the additional noise, RRT analyses typically require very large sample sizes (Ulrich, Schröter, Striegel, & Simon, 2012). This can be remedied by means of sequential analysis. The function curtailed_rrt()
computes the required maximum sample size, decision thresholds, and resulting exact error rates for a curtailed sampling design. The function asn()
computes the expected sample size for a specific value of the sensitive prevalence (or vice versa) in a given curtailed sampling plan. The function curtailed_rrt_result()
computes decision, subsequent prevalence estimate, variance and confidence interval based on empirical data. To use curtailed_rrt()
, asn()
and curtailed_rrt_result()
, please install the R software environment. For details on curtailed sampling plans for RRMs refer to Reiber, Schnuerch, and Ulrich (2020).
To make the functions available, you need to download, unzip, and run the script files curtailed_rrt.R and curtailed_rrt_result.R. The scripts contain a number of help functions (some of which will be loaded invisibly) and the main functions curtailed_rrt()
, asn()
and curtailed_rrt_result()
. For reasons of computational efficiency, some of the help functions have been implemented directly in C++. Therefore, in order to use the function, three script files are required:
Make sure that all files are placed in your current working directory (which can be set to a specific location via setwd("specify/the/path")
). To execute the script and make the functions available, use the following command in your script:
source("curtailed_rrt.R")
source("curtailed_rrt_result.R")
Sourcing the scripts may take a moment for the C++ code to compile and required packages to be loaded. The function requires the following additional packages:
Rcpp
ggplot2
gridExtra
dplyr
If the packages have not yet been installed, the script will automatically do so (this will be indicated by a warning, however). In case of problems with installing the required packages, please install and load them manually using the following code:
install.packages("package_name")
library(package_name)
curtailed_rrt()
accepts/requires the following arguments:
str(formals(curtailed_rrt))
Dotted pair list of 8
$ pi0 : symbol
$ pi1 : symbol
$ alpha: num 0.05
$ power: NULL
$ nMax : NULL
$ model: symbol
$ p : symbol
$ q : NULL
Arguments | |
---|---|
pi0, pi1 | expected prevalence of the sensitive attribute under the null (pi0) and the alternative hypothesis (pi1). |
alpha | (upper bound) Type I error probability of the test. The default is 0.05. |
power | (lower bound) power (1 – Type II error probability) of the test. If NULL (default), the resulting power for the specified maximum number of observations is computed. |
nMax | maximum number of observations required by the test. If NULL (default), the number of observations will be computed based on the other parameters. |
model | a character string indicating the used RRT model. Must be one of ‘Warner’ (original Warner model), ‘UQM’ (Unrelated Question Model), or ‘CWM’ (Crosswise Model). |
p | randomization probability. |
q | prevalence of the neutral attribute in the Unrelated Question Model. If model is one of {Warner, CWM}, the argument is ignored. |
Either power or nMax must be specified in the function call. If power is specified, the function computes the required maximum sample size and threshold values of the sequential procedure for the required error probabilities. Threshold values will be given in terms of number of affirmative responses (c) required to accept \(\mathcal{H}_1\) as well as the number of negative responses (\(N_{max} - c + 1\)) required to accept \(\mathcal{H}_0\).
Note that the probability distribution of observed data is discrete (i.e., a binomial distribution), hence, the specified error rates (alpha and 1 – power) typically cannot be satisfied exactly. However, rather than assuming normal approximation, curtailed_rrt()
computes the smallest possible \(N_{max}\) and \(c\) such that the resulting curtailed procedure has exact error probabilities closest to, but never larger than the specified probabilities.
This computation is based on a numerical search algorithm which searches for the optimal parameters in a specified interval. By default, this interval is defined as \([1; 1,000,000]\). If the required maximum sample size is outside of this interval, the function will return NaN
for the relevant parameters. The range of the search space can be adjusted in the curtailed_rrt.R script, but as this will hardly be of practical interest, it was not included as an argument in the main function.
curtailed_rrt()
returns a list of class c_rrt
with the following components:
Component | |
---|---|
model | the RRT model specified in the function call (will be one of {Warner’s Model; Crosswise Model; Unrelated Question Model}) |
rrt.parameters | a named vector containing the randomization probability p and (if applicable) the neutral prevalence q. |
hypotheses | a named vector containing the expected prevalence pi of the sensitive attribute under the null and the alternative hypothesis. |
nMax | the maximum number of observations required, either provided by the user or computed from the input. |
thresholds | a named vector containing the number of affirmative responses (\(c\)) required to accept the alternative hypothesis, and the number of negative responses (\(N_{max} - c + 1)\) required to accept the null hypothesis. |
parameters | a named vector denoting the exact Type I error probability as well as the power of the curtailed procedure. |
Next to the main function, the script defines generic print
and plot
functions for objects of class c_rrt
. Thus, printing the output of curtailed_rrt()
will return a user-friendly overview of the results. See the following examples for details on the function:
Example 1: Compute nMax and c to reach certain level of power (in UQM)
ex1_uqm <- curtailed_rrt(pi0 = .05, pi1 = .15, alpha = .05, power = .9, nMax = NULL,
model = "UQM", p = .75, q = .7)
ex1_uqm # same as print(ex1_uqm)
Curtailed RRT Design: Unrelated Question Model
Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N = 290
Decision thresholds:
H1: No of affirmative responses H0: No of negative responses
74 217
Resulting error rates:
Type I error Power
0.046 0.901
Example 2: Compute power for a given nMax (in CWM)
ex2_cwm <- curtailed_rrt(pi0 = .1, pi1 = .3, alpha = .05, power = NULL, nMax = 400,
model = "CWM", p = 2/3)
ex2_cwm
Curtailed RRT Design: Crosswise Model
Hypotheses on sensitive prevalence:
H0: pi = 0.1
H1: pi = 0.3
Randomization probability p = 0.67
Maximum number of observations: N = 400
Decision thresholds:
H1: No of affirmative responses H0: No of negative responses
164 237
Resulting error rates:
Type I error Power
0.041 0.839
The object returned by curtailed_rrt()
can also be passed on to the plot()
function. This will, by default, return the Operating-Characteristic (OC) and the Average-Sample-Number (ASN) function. When plotting an object returned by curtailed_rrt()
, the function accepts two additional arguments:
str(formals(plot.c_rrt))
Dotted pair list of 3
$ x : symbol
$ asn: logi TRUE
$ oc : logi TRUE
Both oc
and asn
are TRUE
by default. By setting one of {oc; asn}
to FALSE
, the function will only plot the other one. Note that specifying one argument as TRUE
without explicitly setting the other as FALSE
wil always return both plots.
Example 3: Plot OC and ASN function for Example 1
plot(ex1_uqm)
Example 4: Plot either OC or ASN function for Example 2
plot(ex2_cwm, oc = TRUE) # will return both
plot(ex2_cwm, asn = FALSE) # only return OC function
plot(ex2_cwm, oc = FALSE) # only return ASN function
asn()
accepts/requires the following arguments:
str(formals(asn))
Dotted pair list of 3
$ plan: symbol
$ pi : NULL
$ asn : NULL
Arguments | |
---|---|
plan | a list of class c_rrt returned from curtailed_rrt() . |
pi | the sensitive prevalence for which the expected sample size in the given sampling plan is computed. Must be in [0,1]. The default is NULL. |
asn | the expected sample size (average sample number) for which the corresponding prevalence of the sensitive attribute is computed. The default is NULL. |
The function requires the argument plan
, which must be an object of class c_rrt
(as returned from curtailed_rrt()
). If both pi
and asn
are NULL
(default), the function returns the maximum expected sample size and the corresponding prevalence of the sensitive attribute (\(\pi\)) for the given sampling plan. If pi
is specified, the function will compute the expected sample size when pi
is the true prevalence. If asn
is specified, the function computes the prevalance for which asn
denotes the expected sample size. Note that only one of {pi; asn}
can be specified per function call (asn()
will return an error otherwise).
Example 5: Compute maximum ASN for Example 1
asn(ex1_uqm)
The maximum average sample number is 278.53 for pi = 0.081.
Example 6: Compute ASN for a specific \(\pi\) in Example 1
asn(ex1_uqm, pi = .3)
The maximum average sample number is 278.53 for pi = 0.081.
The average sample number is 185 for pi = 0.3.
Example 7: Compute \(\pi\) associated with a specific ASN in Example 1
asn(ex1_uqm, asn = 200)
The maximum average sample number is 278.53 for pi = 0.081.
The average sample number is 200 for pi = 0.26.
curtailed_rrt_result()
accepts/requires the following arguments:
str(formals(curtailed_rrt_result))
Dotted pair list of 4
$ plan : symbol
$ data : NULL
$ N : NULL
$ conf.level: num 0.95
Arguments | |
---|---|
plan | a list of class c_rrt returned from curtailed_rrt() . |
data | a data vector containing 0 and 1 denoting failures and successes. If NULL (default), the computations will be based on N. |
N | a vector of length 2 containing the observed number of successes and failures. If NULL (default), the computations will be based on data. |
conf.level | confidence level of the Clopper-Pearson interval. The default is 0.95. |
Either data
or N
must be specified in the function call. Calculations of the decision, subsequent estimate including standard error and confidence interval are possible based on both specifications. For a plotable output object, data
must be specified.
If N
is specified, exactly one decision bound must have been reached. Calculations based on N
are not possible if less or more data than necessary have been collected. If more data than necessary were collected, provide data
argument instead.
If data
is specified, at least one decision bound must have been reached. Calculations based on data
are not possible if less data than necessary have been collected. If more data than necessary have been collected the results will be based on the data available, when the first bound was reached. In this case curtailed_rrt_result()
returns a warning.
curtailed_rrt_result()
returns a list of class c_rrt_result
with the following components:
Component | |
---|---|
model | the RRT model specified in the function call (will be one of {Warner’s Model; Crosswise Model; Unrelated Question Model}) |
rrt.parameters | a named vector containing the randomization probability p and (if applicable) the neutral prevalence q. |
hypotheses | a named vector containing the expected prevalence pi of the sensitive attribute under the null and the alternative hypothesis. |
nMax | the maximum number of observations required. |
thresholds | a named vector containing the number of affirmative responses (\(c\)) required to accept the alternative hypothesis, and the number of negative responses (\(N_{max} - c + 1)\) required to accept the null hypothesis. |
parameters | a named vector denoting the exact Type I error probability as well as the power of the curtailed procedure. |
data | a data frame containing the cumulative sum of successes, failures and total responses. NULL if data was not specified. |
N | a named vector containing the number of successes, the number of failures and the total number of responses when reaching the first bound. |
decision | a character string, either ‘rej’, if the horizontal bound was reached, indicating that H1 should be accepted, or ‘acc’, if the vertical bound was reached, indicating that H0 should be accepted. |
estimate | subsequent prevalence estimate. |
se | standard error of the prevalence estimate. |
ci | Clopper-Pearson confidence interval of the subsequent prevalence estimate. |
cl | confidence level of Clopper-Pearson confidence interval. |
Next to the main function, the script curtailed_rrt_result.R
defines generic print
and plot
functions for objects of class c_rrt_result
. Thus, printing the output of curtailed_rrt_result()
will return a user-friendly overview of the results. See the following examples for details on the function:
Example 8: Compute results based on raw data (using the sampling plan created in Example 1)
dat <- rbinom(300,1,0.4)
ex5_res_dat <- curtailed_rrt_result(plan = ex1_uqm, data = dat)
Warning in curtailed_rrt_result(plan = ex1_uqm, data = dat): More than necessary data collected.
Results are based on the data when the first bound was reached.
ex5_res_dat
Curtailed RRT Design: Unrelated Question Model
Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N = 290
Decision thresholds:
H1: No of affirmative responses H0: No of negative responses
74 217
Resulting error rates:
Type I error Power
0.046 0.901
Observed responses when reaching the first bound:
Yes No Total
74 96 170
Decision:
Bound c of 74 affirmative responses was reached. Reject H0!
Subsequent estimate: 0.343
Clopper-Pearson 95 % confidence interval: 0.249 to 0.447
Note the warning indicating that dat
contains more data than necessary. The results are based on the data available in the moment when the first bound (here c) was reached.
Example 9: Compute results based on N (using the sampling plan created in Example 1)
exN <- c(74,109)
ex6_res_N <- curtailed_rrt_result(plan = ex1_uqm, N = exN, conf.level = .90)
ex6_res_N
Curtailed RRT Design: Unrelated Question Model
Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N = 290
Decision thresholds:
H1: No of affirmative responses H0: No of negative responses
74 217
Resulting error rates:
Type I error Power
0.046 0.901
Observed responses when reaching the first bound:
Yes No Total
74 109 183
Decision:
Bound c of 74 affirmative responses was reached. Reject H0!
Subsequent estimate: 0.301
Clopper-Pearson 90 % confidence interval: 0.224 to 0.387
Note that the confidence level was changed from the default (95%) to 90%.
The object returned by curtailed_rrt_result()
can also be passed on to the plot()
function. However, a plot can only be returned if the object was created using the data
argument.
Example 10: Plot the results created in Example 8 (using raw data)
plot(ex5_res_dat)
Example 11: Plot the results created in Example 9 (using N)
plot(ex6_res_N) # will return an error
Error in plot.c_rrt_result(ex6_res_N): c_rrt_result object does not contain data to be plotted.
curtailed_rrt.R, curtailed_rrt_result.R and curtailed_rrt_helper.cpp may be used for non-commercial purposes free of charge. Although considerable effort was put into developing and testing the functions provided herein, there is no warranty whatsoever. Please refer to Reiber et al. (2020) for further information on the RRT and curtailed sampling. We are grateful for comments, questions, or suggestions. Please address communication concerning the R and C++ script provided herein to Fabiola Reiber or Martin Schnuerch.