Lecture 2: Autocorrelation

Understanding autocorrelation and variance-covariance structures in longitudinal data
Biostat 655
Author
Published

July 22, 2025

Course Information

Course: Biostat 655 – Analysis of Multilevel and Longitudinal Data
Instructors: Dr. Zheyu Wang, Dr. Ji Soo Kim


Essential Concepts from Lecture 1

  • Decomposition of the association between X and Y into cross-sectional and longitudinal components:

\[ \begin{align} Y_{ij} &= \beta_0 + \beta_1 x_{ij} + \epsilon_{ij} \\ &= \beta_0 + \beta_1 (x_{ij} - x_{i1} + x_{i1}) + \epsilon_{ij} \\ &= \beta_0 + \beta_1 x_{i1} + \beta_1 (x_{ij} - x_{i1}) + \epsilon_{ij} \\ &= \beta_0 + \beta_C x_{i1} + \beta_L (x_{ij} - x_{i1}) + \epsilon_{ij} \end{align} \]

This model implies two key identities:

  1. Cross-sectional effect (C): \(Y_{i1} = \beta_C x_{i1} + \epsilon_{i1}\)
  1. Longitudinal effect (L): \(Y_{ij} - Y_{i1} = \beta_L (x_{ij} - x_{i1}) + (\epsilon_{ij} - \epsilon_{i1})\)    
  • What’s Intraclass Correlation Coefficient (ICC), aka \(\rho\)

We model the outcome as: \(Y_{ij} = \beta_0 + b_i + \epsilon_{ij}\)

\(\beta_0\) : fixed effect (overall intercept)
\(b_i\): random effect (random intercept)
\(\epsilon_{ij} \sim N(0, \sigma^2)\): residual error

Variance of the Outcome:

\[ \begin{align} {Var}(Y_{ij}) &= \text{Var}(b_i + \epsilon_{ij}) \\ &= \text{Var}(b_i) + \text{Var}(\epsilon_{ij}) \\ &= \tau^2 + \sigma^2 \end{align} \]

Covariance Within a Cluster:

\[ \begin{aligned} \text{Cov}(Y_{ij}, Y_{ik}) &= \text{Cov}(b_i + \epsilon_{ij}, b_i + \epsilon_{ik}) \\ &= \text{Cov}(b_i, b_i) + \text{Cov}(\epsilon_{ij}, \epsilon_{ik}) \\ &= \tau^2 \end{aligned} \]

Intraclass Correlation Coefficient (ICC)

\[ \begin{aligned} \rho &= \text{Corr}(Y_{ij}, Y_{ik}) \\ &= \frac{\text{Cov}(Y_{ij}, Y_{ik})}{\sqrt{\text{Var}(Y_{ij}) \cdot \text{Var}(Y_{ik})}} \\ &= \frac{\tau^2}{\tau^2 + \sigma^2} \\ &= \frac{\text{Total Variance} - \text{Within-Cluster Variance}}{\text{Total Variance}} \end{aligned} \]

  • Consequences of ignoring clustering in data

    • Regression estimates remain unbiased if missing data is minimal
    • Standard errors, confidence intervals, and tests may be incorrect
    • Estimates may be inefficient (i.e., more variable than necessary)

Autocorrelation

  • “Past is prologue”
  • “No two people are alike”
  • Repeated measurements on a person tend to be more similar to each other than to observations from another person!

Variance-Covariance Structure

  • Expressed in terms of common variance and correlation:
    • \(Corr(Y_{ij}, Y_{ik}) = Corr(\epsilon_{ij}, \epsilon_{ik}) = \rho_{jk} = \frac{Cov(\epsilon_{ij}, \epsilon_{ik})}{var(\epsilon_{ij}) \cdot var(\epsilon_{ik})}\)
    • \(Cov(\epsilon_{ij}, \epsilon_{ik}) = \rho_{jk}\sigma^2\)

Empirical Correlation Matrix for Residuals

\(t_{ij}\)
\(t_{ik}\) 2003 2004 2005 2006
2004 0.983
2005 0.971 0.986
2006 0.970 0.977 0.983
2007 0.970 0.965 0.963 0.984
  • Lag Formula: \(\text{LAG} = |t_{ij} - t_{ik}|\)

  • Autocorrelation function: \(\rho(u) = Corr(r_{ij}, r_{ik}), k = j + u\)

  • The standard error of \(\rho(u)\) is roughly \(\frac{1}{\sqrt{N(u)}}\) where \({N(u)}\) is the number of pairs of observations at lag \(u\)

  • Lag 1 correlation estimates: 0.983, 0.986, 0.983, 0.984 -> Pooled Lag 1 correlation estimate: 0.984 (average)

  • Lag 2 correlation estimates: 0.971, 0.977, 0.963 -> Pooled Lag 2 correlation estimate: 0.970 (average)

  • Lag 3 correlation estimates: 0.970, 0.965 -> Pooled Lag 3 correlation estimate: 0.966 (average)

  • Lag 4 correlation estimates: 0.970


Estimating ACF

  • In Stata: autocor
  • In SAS: see autocorr.sas
  • Use tolerance limits to assess significance

Autocovariance Matrix

  • To know if the residuals have roughly the same variance at each time-point
  • Use cov option in R or Stata
  • Norms
    • Administrative databases (e.g. Medicare data): large correlation in the outcomes across time even for large lags.
    • Individual patients: Correlation decays as the time lag increases and the correlation does not appear to go to zero even at large lags.

Common Covariance Structures

1. Unstructured (Raw)

  • Estimate all variances and covariances
  • Requires \(\frac{n(n+1)}{2}\) parameters, in this case 4*5/2 = 10
Time1 Time2 Time3 Time4
Time1 Var₁ Cov(1,2) Cov(1,3) Cov(1,4)
Time2 Cov(2,1) Var₂ Cov(2,3) Cov(2,4)
Time3 Cov(3,1) Cov(3,2) Var₃ Cov(3,4)
Time4 Cov(4,1) Cov(4,2) Cov(4,3) Var₄

2. Compound Symmetry (Exchangeable)

  • Same variance, same correlation
  • Only 2 parameters needed
  • Assuming autocorrelation is not fading away by time
Time1 Time2 Time3 Time4
Time1 Var Var*\(\rho\) Var*\(\rho\) Var*\(\rho\)
Time2 Var*\(\rho\) Var Var*\(\rho\) Var*\(\rho\)
Time3 Var*\(\rho\) Var*\(\rho\) Var Var*\(\rho\)
Time4 Var*\(\rho\) Var*\(\rho\) Var*\(\rho\) Var

3. Autoregressive (AR-1)

  • The correlation at lag \(k\) is \(\rho^{|k|}\)
  • Only 2 parameters needed
  • Suitable for discrete time
Time1 Time2 Time3 Time4
Time1 Var Var*\(\rho\) Var*\(\rho^2\) Var*\(\rho^3\)
Time2 Var*\(\rho\) Var Var*\(\rho\) Var*\(\rho^2\)
Time3 Var*\(\rho^2\) Var*\(\rho\) Var Var*\(\rho\)
Time4 Var*\(\rho^3\) Var*\(\rho^2\) Var*\(\rho\) Var

4. Toeplitz

  • One variance; Different correlation per lag
  • Requires \(n\) parameters
Time1 Time2 Time3 Time4
Time1 Var Var*\(\rho\) Var*\(\rho_2\) Var*\(\rho_3\)
Time2 Var*\(\rho\) Var Var*\(\rho\) Var*\(\rho_2\)
Time3 Var*\(\rho_2\) Var*\(\rho\) Var Var*\(\rho\)
Time4 Var*\(\rho_3\) Var*\(\rho_2\) Var*\(\rho\) Var

6. Banded Structures

  • Set correlations to zero beyond lag \(k\) (e.g., Banded Toeplitz 2)
Time1 Time2 Time3 Time4
Time1 Var Var*\(\rho\) 0 0
Time2 Var*\(\rho\) Var Var*\(\rho\) 0
Time3 0 Var*\(\rho\) Var Var*\(\rho\)
Time4 0 0 Var*\(\rho\) Var

7. Heteroskedastic Models

  • Allow variance to vary by time
  • Combine with AR, Toeplitz, or CS structures
  • Heteroskedastic toeplitz: n + (n-1) estimates
Time1 Time2 Time3 Time4
Time1 Var1 \(\sqrt{Var_1}*\sqrt{Var_2}*\rho\) \(\sqrt{Var_1}*\sqrt{Var_3}*\rho_2\) \(\sqrt{Var_1}*\sqrt{Var_4}*\rho_3\)
Time2 \(\sqrt{Var_1}*\sqrt{Var_2}*\rho\) Var2 \(\sqrt{Var_2}*\sqrt{Var_3}*\rho\) \(\sqrt{Var_2}*\sqrt{Var_4}*\rho_2\)
Time3 \(\sqrt{Var_1}*\sqrt{Var_3}*\rho_2\) \(\sqrt{Var_2}*\sqrt{Var_3}*\rho\) Var3 \(\sqrt{Var_3}*\sqrt{Var_4}*\rho\)
Time4 \(\sqrt{Var_1}*\sqrt{Var_4}*\rho_3\) \(\sqrt{Var_2}*\sqrt{Var_4}*\rho_2\) \(\sqrt{Var_3}*\sqrt{Var_4}*\rho\) Var4