Curtailed RRT

Psychology | Methods

Reiber, F., Schnuerch, M. & Ulrich, R. (2020). Improving the efficiency of surveys with randomized response models: A sequential approach based on curtailed sampling. Psychological Methods. Advance Online Publication. [doi][OSF]

Download R Script (Curtailed_rrt.R)

Introduction

The randomized response technique (RRT) is used to estimate or test prevalence rates of sensitive attributes. Due to their sensitive nature, such attributes typically elicit socially desirable responding in direct questioning techniques, leading to underestimation of—and wrong inferences about—population prevalences. The RRT protects individual anonymity by adding random noise to single answers, thus overcoming response biases (Warner, 1965).

Due to the additional noise, RRT analyses typically require very large sample sizes (Ulrich, Schröter, Striegel, & Simon, 2012). This can be remedied by means of sequential analysis. The function curtailed_rrt() computes the required maximum sample size, decision thresholds, and resulting exact error rates for a curtailed sampling design. The function asn() computes the expected sample size for a specific value of the sensitive prevalence (or vice versa) in a given curtailed sampling plan. The function curtailed_rrt_result() computes decision, subsequent prevalence estimate, variance and confidence interval based on empirical data. To use curtailed_rrt(), asn() and curtailed_rrt_result(), please install the R software environment. For details on curtailed sampling plans for RRMs refer to Reiber, Schnuerch, and Ulrich (2020).

General set up

To make the functions available, you need to download, unzip, and run the script files curtailed_rrt.R and curtailed_rrt_result.R. The scripts contain a number of help functions (some of which will be loaded invisibly) and the main functions curtailed_rrt(), asn() and curtailed_rrt_result(). For reasons of computational efficiency, some of the help functions have been implemented directly in C++. Therefore, in order to use the function, three script files are required:

The R script curtailed_rrt.R
The R script curtailed_rrt_result.R
The C++ script curtailed_rrt_helpers.cpp

Make sure that all files are placed in your current working directory (which can be set to a specific location via setwd("specify/the/path")). To execute the script and make the functions available, use the following command in your script:

source("curtailed_rrt.R")
source("curtailed_rrt_result.R")

Sourcing the scripts may take a moment for the C++ code to compile and required packages to be loaded. The function requires the following additional packages:

Rcpp
ggplot2
gridExtra
dplyr

If the packages have not yet been installed, the script will automatically do so (this will be indicated by a warning, however). In case of problems with installing the required packages, please install and load them manually using the following code:

install.packages("package_name")
library(package_name)

Function curtailed_rrt()

Input

curtailed_rrt() accepts/requires the following arguments:

str(formals(curtailed_rrt))

Dotted pair list of 8
 $ pi0  : symbol 
 $ pi1  : symbol 
 $ alpha: num 0.05
 $ power: NULL
 $ nMax : NULL
 $ model: symbol 
 $ p    : symbol 
 $ q    : NULL

Arguments
pi0, pi1	expected prevalence of the sensitive attribute under the null (pi0) and the alternative hypothesis (pi1).
alpha	(upper bound) Type I error probability of the test. The default is 0.05.
power	(lower bound) power (1 – Type II error probability) of the test. If NULL (default), the resulting power for the specified maximum number of observations is computed.
nMax	maximum number of observations required by the test. If NULL (default), the number of observations will be computed based on the other parameters.
model	a character string indicating the used RRT model. Must be one of ‘Warner’ (original Warner model), ‘UQM’ (Unrelated Question Model), or ‘CWM’ (Crosswise Model).
p	randomization probability.
q	prevalence of the neutral attribute in the Unrelated Question Model. If model is one of {Warner, CWM}, the argument is ignored.

Either power or nMax must be specified in the function call. If power is specified, the function computes the required maximum sample size and threshold values of the sequential procedure for the required error probabilities. Threshold values will be given in terms of number of affirmative responses (c) required to accept \(\mathcal{H}_1\) as well as the number of negative responses (\(N_{max} - c + 1\)) required to accept \(\mathcal{H}_0\).

Note that the probability distribution of observed data is discrete (i.e., a binomial distribution), hence, the specified error rates (alpha and 1 – power) typically cannot be satisfied exactly. However, rather than assuming normal approximation, curtailed_rrt() computes the smallest possible \(N_{max}\) and \(c\) such that the resulting curtailed procedure has exact error probabilities closest to, but never larger than the specified probabilities.

This computation is based on a numerical search algorithm which searches for the optimal parameters in a specified interval. By default, this interval is defined as \([1; 1,000,000]\). If the required maximum sample size is outside of this interval, the function will return NaN for the relevant parameters. The range of the search space can be adjusted in the curtailed_rrt.R script, but as this will hardly be of practical interest, it was not included as an argument in the main function.

Output

curtailed_rrt() returns a list of class c_rrt with the following components:

Component
model	the RRT model specified in the function call (will be one of {Warner’s Model; Crosswise Model; Unrelated Question Model})
rrt.parameters	a named vector containing the randomization probability p and (if applicable) the neutral prevalence q.
hypotheses	a named vector containing the expected prevalence pi of the sensitive attribute under the null and the alternative hypothesis.
nMax	the maximum number of observations required, either provided by the user or computed from the input.
thresholds	a named vector containing the number of affirmative responses (\(c\)) required to accept the alternative hypothesis, and the number of negative responses (\(N_{max} - c + 1)\) required to accept the null hypothesis.
parameters	a named vector denoting the exact Type I error probability as well as the power of the curtailed procedure.

Printing

Next to the main function, the script defines generic print and plot functions for objects of class c_rrt. Thus, printing the output of curtailed_rrt() will return a user-friendly overview of the results. See the following examples for details on the function:

Example 1: Compute nMax and c to reach certain level of power (in UQM)

ex1_uqm <- curtailed_rrt(pi0 = .05, pi1 = .15, alpha = .05, power = .9, nMax = NULL,
                         model = "UQM", p = .75, q = .7)
ex1_uqm # same as print(ex1_uqm)


Curtailed RRT Design: Unrelated Question Model

Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N =  290
Decision thresholds:
H1: No of affirmative responses    H0: No of negative responses 
                             74                             217 
Resulting error rates:
Type I error        Power 
       0.046        0.901

Example 2: Compute power for a given nMax (in CWM)

ex2_cwm <- curtailed_rrt(pi0 = .1, pi1 = .3, alpha = .05, power = NULL, nMax = 400,
                         model = "CWM", p = 2/3)
ex2_cwm


Curtailed RRT Design: Crosswise Model

Hypotheses on sensitive prevalence:
H0: pi = 0.1
H1: pi = 0.3
Randomization probability p = 0.67
Maximum number of observations: N =  400
Decision thresholds:
H1: No of affirmative responses    H0: No of negative responses 
                            164                             237 
Resulting error rates:
Type I error        Power 
       0.041        0.839

Plotting

The object returned by curtailed_rrt() can also be passed on to the plot() function. This will, by default, return the Operating-Characteristic (OC) and the Average-Sample-Number (ASN) function. When plotting an object returned by curtailed_rrt(), the function accepts two additional arguments:

str(formals(plot.c_rrt))

Dotted pair list of 3
 $ x  : symbol 
 $ asn: logi TRUE
 $ oc : logi TRUE

Both oc and asn are TRUE by default. By setting one of {oc; asn} to FALSE, the function will only plot the other one. Note that specifying one argument as TRUE without explicitly setting the other as FALSE wil always return both plots.

Example 3: Plot OC and ASN function for Example 1

plot(ex1_uqm)

Example 4: Plot either OC or ASN function for Example 2

plot(ex2_cwm, oc = TRUE) # will return both

plot(ex2_cwm, asn = FALSE) # only return OC function

plot(ex2_cwm, oc = FALSE) # only return ASN function

Function asn()

asn() accepts/requires the following arguments:

str(formals(asn))

Dotted pair list of 3
 $ plan: symbol 
 $ pi  : NULL
 $ asn : NULL

Arguments
plan	a list of class `c_rrt` returned from `curtailed_rrt()`.
pi	the sensitive prevalence for which the expected sample size in the given sampling plan is computed. Must be in [0,1]. The default is NULL.
asn	the expected sample size (average sample number) for which the corresponding prevalence of the sensitive attribute is computed. The default is NULL.

The function requires the argument plan, which must be an object of class c_rrt (as returned from curtailed_rrt()). If both pi and asn are NULL (default), the function returns the maximum expected sample size and the corresponding prevalence of the sensitive attribute (\(\pi\)) for the given sampling plan. If pi is specified, the function will compute the expected sample size when pi is the true prevalence. If asn is specified, the function computes the prevalance for which asn denotes the expected sample size. Note that only one of {pi; asn} can be specified per function call (asn() will return an error otherwise).

Example 5: Compute maximum ASN for Example 1

asn(ex1_uqm)


The maximum average sample number is 278.53 for pi = 0.081.

Example 6: Compute ASN for a specific \(\pi\) in Example 1

asn(ex1_uqm, pi = .3)


The maximum average sample number is 278.53 for pi = 0.081.

The average sample number is 185 for pi = 0.3.

Example 7: Compute \(\pi\) associated with a specific ASN in Example 1

asn(ex1_uqm, asn = 200)


The maximum average sample number is 278.53 for pi = 0.081.

The average sample number is 200 for pi = 0.26.

Function curtailed_rrt_results()

Input

curtailed_rrt_result() accepts/requires the following arguments:

str(formals(curtailed_rrt_result))

Dotted pair list of 4
 $ plan      : symbol 
 $ data      : NULL
 $ N         : NULL
 $ conf.level: num 0.95

Arguments
plan	a list of class `c_rrt` returned from `curtailed_rrt()`.
data	a data vector containing 0 and 1 denoting failures and successes. If NULL (default), the computations will be based on N.
N	a vector of length 2 containing the observed number of successes and failures. If NULL (default), the computations will be based on data.
conf.level	confidence level of the Clopper-Pearson interval. The default is 0.95.

Either data or N must be specified in the function call. Calculations of the decision, subsequent estimate including standard error and confidence interval are possible based on both specifications. For a plotable output object, data must be specified.

If N is specified, exactly one decision bound must have been reached. Calculations based on N are not possible if less or more data than necessary have been collected. If more data than necessary were collected, provide data argument instead.

If data is specified, at least one decision bound must have been reached. Calculations based on data are not possible if less data than necessary have been collected. If more data than necessary have been collected the results will be based on the data available, when the first bound was reached. In this case curtailed_rrt_result() returns a warning.

Output

curtailed_rrt_result() returns a list of class c_rrt_result with the following components:

Component
model	the RRT model specified in the function call (will be one of {Warner’s Model; Crosswise Model; Unrelated Question Model})
rrt.parameters	a named vector containing the randomization probability p and (if applicable) the neutral prevalence q.
hypotheses	a named vector containing the expected prevalence pi of the sensitive attribute under the null and the alternative hypothesis.
nMax	the maximum number of observations required.
thresholds	a named vector containing the number of affirmative responses (\(c\)) required to accept the alternative hypothesis, and the number of negative responses (\(N_{max} - c + 1)\) required to accept the null hypothesis.
parameters	a named vector denoting the exact Type I error probability as well as the power of the curtailed procedure.
data	a data frame containing the cumulative sum of successes, failures and total responses. NULL if data was not specified.
N	a named vector containing the number of successes, the number of failures and the total number of responses when reaching the first bound.
decision	a character string, either ‘rej’, if the horizontal bound was reached, indicating that H1 should be accepted, or ‘acc’, if the vertical bound was reached, indicating that H0 should be accepted.
estimate	subsequent prevalence estimate.
se	standard error of the prevalence estimate.
ci	Clopper-Pearson confidence interval of the subsequent prevalence estimate.
cl	confidence level of Clopper-Pearson confidence interval.

Printing

Next to the main function, the script curtailed_rrt_result.R defines generic print and plot functions for objects of class c_rrt_result. Thus, printing the output of curtailed_rrt_result() will return a user-friendly overview of the results. See the following examples for details on the function:

Example 8: Compute results based on raw data (using the sampling plan created in Example 1)

dat <- rbinom(300,1,0.4)
ex5_res_dat <- curtailed_rrt_result(plan = ex1_uqm, data = dat)

Warning in curtailed_rrt_result(plan = ex1_uqm, data = dat): More than necessary data collected.
  Results are based on the data when the first bound was reached.

ex5_res_dat


Curtailed RRT Design: Unrelated Question Model

Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N =  290
Decision thresholds:
H1: No of affirmative responses    H0: No of negative responses 
                             74                             217 
Resulting error rates:
Type I error        Power 
       0.046        0.901 
Observed responses when reaching the first bound:
  Yes    No Total 
   74    96   170 

Decision:
Bound c of 74 affirmative responses was reached. Reject H0!

Subsequent estimate:  0.343 
Clopper-Pearson  95 % confidence interval:  0.249  to  0.447

Note the warning indicating that dat contains more data than necessary. The results are based on the data available in the moment when the first bound (here c) was reached.

Example 9: Compute results based on N (using the sampling plan created in Example 1)

exN <- c(74,109)
ex6_res_N <- curtailed_rrt_result(plan = ex1_uqm, N = exN, conf.level = .90)
ex6_res_N


Curtailed RRT Design: Unrelated Question Model

Hypotheses on sensitive prevalence:
H0: pi = 0.05
H1: pi = 0.15
Randomization probability p = 0.75
Neutral prevalence q = 0.7
Maximum number of observations: N =  290
Decision thresholds:
H1: No of affirmative responses    H0: No of negative responses 
                             74                             217 
Resulting error rates:
Type I error        Power 
       0.046        0.901 
Observed responses when reaching the first bound:
  Yes    No Total 
   74   109   183 

Decision:
Bound c of 74 affirmative responses was reached. Reject H0!

Subsequent estimate:  0.301 
Clopper-Pearson  90 % confidence interval:  0.224  to  0.387

Note that the confidence level was changed from the default (95%) to 90%.

Plotting

The object returned by curtailed_rrt_result() can also be passed on to the plot() function. However, a plot can only be returned if the object was created using the data argument.

Example 10: Plot the results created in Example 8 (using raw data)

plot(ex5_res_dat)

Example 11: Plot the results created in Example 9 (using N)

plot(ex6_res_N) # will return an error

Error in plot.c_rrt_result(ex6_res_N): c_rrt_result object does not contain data to be plotted.

Concluding Remarks

curtailed_rrt.R, curtailed_rrt_result.R and curtailed_rrt_helper.cpp may be used for non-commercial purposes free of charge. Although considerable effort was put into developing and testing the functions provided herein, there is no warranty whatsoever. Please refer to Reiber et al. (2020) for further information on the RRT and curtailed sampling. We are grateful for comments, questions, or suggestions. Please address communication concerning the R and C++ script provided herein to Fabiola Reiber or Martin Schnuerch.