Course Information
Course: Biostat 655 – Analysis of Multilevel and Longitudinal Data
Instructors: Dr. Zheyu Wang, Dr. Ji Soo Kim
Essential Concepts from Lecture 1
- Decomposition of the association between X and Y into cross-sectional and longitudinal components:
\[ \begin{align} Y_{ij} &= \beta_0 + \beta_1 x_{ij} + \epsilon_{ij} \\ &= \beta_0 + \beta_1 (x_{ij} - x_{i1} + x_{i1}) + \epsilon_{ij} \\ &= \beta_0 + \beta_1 x_{i1} + \beta_1 (x_{ij} - x_{i1}) + \epsilon_{ij} \\ &= \beta_0 + \beta_C x_{i1} + \beta_L (x_{ij} - x_{i1}) + \epsilon_{ij} \end{align} \]
This model implies two key identities:
- Cross-sectional effect (C): \(Y_{i1} = \beta_C x_{i1} + \epsilon_{i1}\)
- Longitudinal effect (L): \(Y_{ij} - Y_{i1} = \beta_L (x_{ij} - x_{i1}) + (\epsilon_{ij} - \epsilon_{i1})\)
- What’s Intraclass Correlation Coefficient (ICC), aka \(\rho\)
We model the outcome as: \(Y_{ij} = \beta_0 + b_i + \epsilon_{ij}\)
\(\beta_0\) : fixed effect (overall intercept)
\(b_i\): random effect (random intercept)
\(\epsilon_{ij} \sim N(0, \sigma^2)\): residual error
Variance of the Outcome:
\[ \begin{align} {Var}(Y_{ij}) &= \text{Var}(b_i + \epsilon_{ij}) \\ &= \text{Var}(b_i) + \text{Var}(\epsilon_{ij}) \\ &= \tau^2 + \sigma^2 \end{align} \]
Covariance Within a Cluster:
\[ \begin{aligned} \text{Cov}(Y_{ij}, Y_{ik}) &= \text{Cov}(b_i + \epsilon_{ij}, b_i + \epsilon_{ik}) \\ &= \text{Cov}(b_i, b_i) + \text{Cov}(\epsilon_{ij}, \epsilon_{ik}) \\ &= \tau^2 \end{aligned} \]
Intraclass Correlation Coefficient (ICC)
\[ \begin{aligned} \rho &= \text{Corr}(Y_{ij}, Y_{ik}) \\ &= \frac{\text{Cov}(Y_{ij}, Y_{ik})}{\sqrt{\text{Var}(Y_{ij}) \cdot \text{Var}(Y_{ik})}} \\ &= \frac{\tau^2}{\tau^2 + \sigma^2} \\ &= \frac{\text{Total Variance} - \text{Within-Cluster Variance}}{\text{Total Variance}} \end{aligned} \]
Consequences of ignoring clustering in data
- Regression estimates remain unbiased if missing data is minimal
- Standard errors, confidence intervals, and tests may be incorrect
- Estimates may be inefficient (i.e., more variable than necessary)
Autocorrelation
- “Past is prologue”
- “No two people are alike”
- Repeated measurements on a person tend to be more similar to each other than to observations from another person!
Variance-Covariance Structure
- Expressed in terms of common variance and correlation:
- \(Corr(Y_{ij}, Y_{ik}) = Corr(\epsilon_{ij}, \epsilon_{ik}) = \rho_{jk} = \frac{Cov(\epsilon_{ij}, \epsilon_{ik})}{var(\epsilon_{ij}) \cdot var(\epsilon_{ik})}\)
- \(Cov(\epsilon_{ij}, \epsilon_{ik}) = \rho_{jk}\sigma^2\)
Empirical Correlation Matrix for Residuals
| \(t_{ij}\) | ||||
|---|---|---|---|---|
| \(t_{ik}\) | 2003 | 2004 | 2005 | 2006 |
| 2004 | 0.983 | |||
| 2005 | 0.971 | 0.986 | ||
| 2006 | 0.970 | 0.977 | 0.983 | |
| 2007 | 0.970 | 0.965 | 0.963 | 0.984 |
Lag Formula: \(\text{LAG} = |t_{ij} - t_{ik}|\)
Autocorrelation function: \(\rho(u) = Corr(r_{ij}, r_{ik}), k = j + u\)
The standard error of \(\rho(u)\) is roughly \(\frac{1}{\sqrt{N(u)}}\) where \({N(u)}\) is the number of pairs of observations at lag \(u\)
Lag 1 correlation estimates: 0.983, 0.986, 0.983, 0.984 -> Pooled Lag 1 correlation estimate: 0.984 (average)
Lag 2 correlation estimates: 0.971, 0.977, 0.963 -> Pooled Lag 2 correlation estimate: 0.970 (average)
Lag 3 correlation estimates: 0.970, 0.965 -> Pooled Lag 3 correlation estimate: 0.966 (average)
Lag 4 correlation estimates: 0.970
Estimating ACF
- In Stata:
autocor - In SAS: see
autocorr.sas - Use tolerance limits to assess significance
Autocovariance Matrix
- To know if the residuals have roughly the same variance at each time-point
- Use
covoption in R or Stata - Norms
- Administrative databases (e.g. Medicare data): large correlation in the outcomes across time even for large lags.
- Individual patients: Correlation decays as the time lag increases and the correlation does not appear to go to zero even at large lags.
Common Covariance Structures
1. Unstructured (Raw)
- Estimate all variances and covariances
- Requires \(\frac{n(n+1)}{2}\) parameters, in this case 4*5/2 = 10
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var₁ | Cov(1,2) | Cov(1,3) | Cov(1,4) |
| Time2 | Cov(2,1) | Var₂ | Cov(2,3) | Cov(2,4) |
| Time3 | Cov(3,1) | Cov(3,2) | Var₃ | Cov(3,4) |
| Time4 | Cov(4,1) | Cov(4,2) | Cov(4,3) | Var₄ |
2. Compound Symmetry (Exchangeable)
- Same variance, same correlation
- Only 2 parameters needed
- Assuming autocorrelation is not fading away by time
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var | Var*\(\rho\) | Var*\(\rho\) | Var*\(\rho\) |
| Time2 | Var*\(\rho\) | Var | Var*\(\rho\) | Var*\(\rho\) |
| Time3 | Var*\(\rho\) | Var*\(\rho\) | Var | Var*\(\rho\) |
| Time4 | Var*\(\rho\) | Var*\(\rho\) | Var*\(\rho\) | Var |
3. Autoregressive (AR-1)
- The correlation at lag \(k\) is \(\rho^{|k|}\)
- Only 2 parameters needed
- Suitable for discrete time
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var | Var*\(\rho\) | Var*\(\rho^2\) | Var*\(\rho^3\) |
| Time2 | Var*\(\rho\) | Var | Var*\(\rho\) | Var*\(\rho^2\) |
| Time3 | Var*\(\rho^2\) | Var*\(\rho\) | Var | Var*\(\rho\) |
| Time4 | Var*\(\rho^3\) | Var*\(\rho^2\) | Var*\(\rho\) | Var |
4. Toeplitz
- One variance; Different correlation per lag
- Requires \(n\) parameters
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var | Var*\(\rho\) | Var*\(\rho_2\) | Var*\(\rho_3\) |
| Time2 | Var*\(\rho\) | Var | Var*\(\rho\) | Var*\(\rho_2\) |
| Time3 | Var*\(\rho_2\) | Var*\(\rho\) | Var | Var*\(\rho\) |
| Time4 | Var*\(\rho_3\) | Var*\(\rho_2\) | Var*\(\rho\) | Var |
6. Banded Structures
- Set correlations to zero beyond lag \(k\) (e.g., Banded Toeplitz 2)
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var | Var*\(\rho\) | 0 | 0 |
| Time2 | Var*\(\rho\) | Var | Var*\(\rho\) | 0 |
| Time3 | 0 | Var*\(\rho\) | Var | Var*\(\rho\) |
| Time4 | 0 | 0 | Var*\(\rho\) | Var |
7. Heteroskedastic Models
- Allow variance to vary by time
- Combine with AR, Toeplitz, or CS structures
- Heteroskedastic toeplitz: n + (n-1) estimates
| Time1 | Time2 | Time3 | Time4 | |
|---|---|---|---|---|
| Time1 | Var1 | \(\sqrt{Var_1}*\sqrt{Var_2}*\rho\) | \(\sqrt{Var_1}*\sqrt{Var_3}*\rho_2\) | \(\sqrt{Var_1}*\sqrt{Var_4}*\rho_3\) |
| Time2 | \(\sqrt{Var_1}*\sqrt{Var_2}*\rho\) | Var2 | \(\sqrt{Var_2}*\sqrt{Var_3}*\rho\) | \(\sqrt{Var_2}*\sqrt{Var_4}*\rho_2\) |
| Time3 | \(\sqrt{Var_1}*\sqrt{Var_3}*\rho_2\) | \(\sqrt{Var_2}*\sqrt{Var_3}*\rho\) | Var3 | \(\sqrt{Var_3}*\sqrt{Var_4}*\rho\) |
| Time4 | \(\sqrt{Var_1}*\sqrt{Var_4}*\rho_3\) | \(\sqrt{Var_2}*\sqrt{Var_4}*\rho_2\) | \(\sqrt{Var_3}*\sqrt{Var_4}*\rho\) | Var4 |