Topic: Methods in Analysis of Large Population
Surveys
Instructor: Dr. Saifuddin Ahmed
Overview
This lecture focuses on variance estimation in complex survey
designs. Standard statistical methods assume simple
random sampling (SRS) and independent, identically
distributed (IID) data. However, complex survey
designs often violate IID assumptions, leading to
incorrect standard errors and confidence intervals.
Why IID Assumptions Fail in Complex Surveys:
- Clustering (Intraclass correlation): Individuals
within clusters (e.g., households, communities) tend to be more similar
to each other.
- Design Effect (DEFF/DEFT): Variance inflation due
to survey design.
- Finite Population Correction (FPC): When you sample
from a small population without replacement, each selection slightly
changes the remaining group, making your estimate more precise. Since
most standard error formulas assume sampling with replacement, the
Finite Population Correction (FPC) adjusts the standard error downward
to account for this.
- Multistage Sampling: Involves stratification,
clustering, and unequal probabilities.
- Weights: Sampling weights affect both point
estimates and variance estimation.
Two Procedures for Variance Estimation in Complex Surveys
Variance estimation methods fall into two broad categories:
1. Non-Parametric Procedures
These methods do not assume a specific distributional
form and are applicable to both linear and non-linear
statistics.
Types of Estimates:
- Linear: Mean, total
- Non-linear: Proportions, ratios, regression
coefficients, medians, etc.
Replication-Based Methods
These methods generate multiple pseudoreplicates and
compute variance based on the variability across replicates.
Simple Replication
Also known as the random groups method.
- Divide the sample into
r independent groups.
- Estimate the statistic in each group.
- Compute variance as:
\[
Var(\hat{\theta}) = \frac{1}{r(r-1)}\sum_{i=1}^{r}(\hat{\theta}_i -
\bar{\theta})^2
\]
- Limitation: Not precise if
r is small;
rarely used in practice.
Balanced Repeated Replication (BRR)
- Works best with 2 PSUs per stratum.
- Divides each stratum into two halves.
- Constructs balanced half-samples using a
Hadamard matrix:
- A square matrix of +1 and -1 where each row is orthogonal to
others.
- Ensures each PSU is included in a balanced way across
replicates.
Formula:
\[
Var_{BRR}(\hat{\theta}) = \frac{1}{K} \sum_{k=1}^K (\hat{\theta}_k -
\hat{\theta})^2
\]
- Used extensively in National Center for Health Statistics
(NCHS) data.
Jackknife Replication
- Systematically drops one PSU at a time.
- For each replicate:
- Recalculate the estimate after omitting one PSU.
- Weight the remaining data accordingly.
Formula:
\[
Var_{jack}(\hat{\theta}) = \frac{n - 1}{n} \sum_{i=1}^n
(\hat{\theta}_{(i)} - \bar{\theta})^2
\]
- If sample size = 500 → 500 replicates.
- Can be applied to any data.
- Very flexible and widely supported.
- Default choice if unsure.
Bootstrap Methods
- Resample from the original sample with
replacement.
- You choose the number of replicates (e.g., 100–1000).
- Used for both linear and non-linear estimates.
Bootstrap Variance Formula:
\[
Var_{boot}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^B (\hat{\theta}_b -
\bar{\theta})^2
\]
- Standard deviation of replicate estimates =
bootstrap SE.
- Based on Efron (1987), use at least 100 replicates
(preferably 500+ with modern computing).
Linearization-Based Methods
Also known as the Taylor Series Linearization or
Delta Method.
- Used to approximate variance of non-linear
statistics by reducing them to linear ones.
- Expands a function \(G(X)\) around
the mean \(\mu\):
\[
Var(G(X)) \approx [G'(\mu)]^2 \cdot Var(X)
\]
- In matrix form: Sandwich Estimator
- Default in Stata and R’s
survey package
2. Parametric Variance Estimation
- Based on model-based assumptions, e.g., linear
regression assumptions.
- Less preferred for complex survey data where design features must be
honored.
Decision-Making Guide
- Use Linearization for most standard analyses
(means, proportions, regression).
- Use Jackknife or Bootstrap when:
- You have complex parameters (e.g., medians, ratios).
- You need robust, design-independent methods.
- Go with defaults unless you have specific
needs.
When Writing Paper
- T-test is not valid for complex surveys, instead, use “adjusted Wald
test” to detect the mean difference
- Chi-square is not valid for complex surveys, instead, use “Rao and
Scott (1984) second-order corrected Chi-square”
End of Summary