Limitations

Statistical Power

The TLC trial may have been underpowered for two independent reasons: the post-hoc 96% power claim requires statistically implausible assumptions, and the 3 IQ point detectable difference was at the upper bound of what the literature supported for this population. Together, these issues suggest the trial had substantially less ability to detect a real treatment effect than the investigators reported.


Protocol vs. Published Assumptions

The power assumptions changed across three documents, each reporting different values for the same trial:

Parameter Value
Protocol (Version 10, Section 10.1)
Assumed SD15 points
Assumed n at 36 months1,040
Covariate adjustmentUnadjusted
Design power90% (or 98% with SD=12)
TLC Group (1998)
Assumed SD14 points
Assumed n at 36 months608 (78% of 780)
Covariate adjustmentANCOVA with R² = 0.16
Design power82%
Rogan et al. (2001)
Assumed SDNot stated
Assumed n at 36 months“Higher than expected”
Covariate adjustment“Better than expected”
Actual power96%

The protocol (Version 10, Section 10.1) assumed 1,040 children at follow-up; actual enrollment was 780 (Rogan et al. 2001). The sample size was lower than the protocol assumption, not higher. This discrepancy is not reconciled in any publication.


The 96% Power Claim

The NEJM paper states:

“Our study was designed to have 82 percent power to detect a difference of three points between the treatment groups in IQ scores at three years of follow-up. The actual power of the study was 96 percent, because the number of children with data at 36 months of follow-up was higher than expected and the correlation between base-line and follow-up psychometric tests was better than expected.”

—Rogan WJ, Dietrich KN, Ware JH, et al. N Engl J Med. 2001;344(19):1421–1426.

We back-calculated the covariate R² required to reach 96% power using the standard ANCOVA power formula with the published parameters (n ≈ 780, δ = 3 points, α = 0.05, two-sided). The result: R² ≈ 0.45 — nearly triple the R² = 0.16 assumed by the TLC Group (1998, p. 320). This is our calculation derived from the investigators’ own published numbers.

The achieved baseline-outcome R² was never reported in any TLC publication. The 96% power claim cannot be independently verified from the published data.


BSID-to-WPPSI Correlations in the Literature

An R² of 0.45 requires a correlation of r ≈ 0.67 between baseline developmental scores (BSID-II at 18–24 months) and the 36-month IQ outcome (WPPSI-R). Published studies demonstrate this is implausibly high:

Study Age at BSID Age at IQ
Kvestad et al. 2022 (Nepal, n=529) 18–23 months 4 years 0.20
Kvestad et al. 2022 (Nepal, n=529) 30–35 months 4 years 0.36
Koshy et al. 2024 (India, n=251) 24 months 5 years 0.16–0.26
Koshy et al. 2024 (India, n=251) 36 months 5 years 0.23–0.29

The R² = 0.16 assumed by the TLC Group (1998) was appropriate and consistent with developmental literature. The “better than expected” correlation claimed by Rogan et al. (2001) yielding 96% power is not supported by typical BSID-to-WPPSI correlations.


The 0.83 Correlation Is Misleading

Rogan et al. (2001) report a correlation of 0.83 between 18-month and 36-month WPPSI-R scores. This figure may appear to support the high-power claim, but it is a same-instrument correlation over a short interval — used for imputation in 32 children who had WPPSI scores at 18 months but not 36 months.

This is not the baseline BSID-II to 36-month WPPSI-R correlation that would drive power in the ANCOVA model. Same-instrument test-retest correlations are expected to be high and are not informative about cross-instrument predictive validity.


Paternal IQ Was Not Measured

The ANCOVA model adjusted for maternal IQ but paternal IQ was never collected. This matters because:

  • Spouse IQ correlations average r ≈ 0.40 — substantially higher than for personality (~0.10) or physical traits (~0.20)
  • Maternal and paternal IQ associations with offspring outcomes are similar in magnitude, and both independently predict child outcomes
  • IQ heritability is approximately 40–50% in childhood

Controlling only for maternal IQ leaves residual genetic confounding and further undermines the plausibility of achieving R² ≈ 0.45 with the available covariates.


The 3 IQ Point Target

Independent of the power claim, the TLC Group (1998, p. 320) powered the trial to detect a 3 IQ point difference. Three key meta-analyses were available when TLC was designed:

Needleman & Gatsonis (1990): Reported partial correlation coefficients (r = −0.15, 95% CI ±0.05) but explicitly stated: “Neither approach provides an overall estimate of the raw effect size, ie, of the average change in IQ units per unit change in lead exposure.”

Pocock et al. (1994): Found that a doubling of blood lead from 10 to 20 µg/dL was associated with a mean deficit of 1–2 IQ points.

Schwartz (1994): Found 2.6 IQ points lost per 10 µg/dL increase from 10 to 20 µg/dL. Critically, Schwartz found the effect varied by population:

Subgroup IQ Points Lost per 10 µg/dL
Studies with mean BLL ≤15 µg/dL 3.23 ± 1.26
Studies with mean BLL >15 µg/dL 2.32 ± 0.40
Non-disadvantaged populations 2.89 ± 0.50
Disadvantaged populations 1.85 ± 0.92

For the TLC population — baseline BLL 26.2 µg/dL (Rogan et al. 2001, Table 1), predominantly low-income and minority (TLC Group 1998, Table 2) — the expected IQ gain using Schwartz (1994) for disadvantaged populations is approximately 2.1 points, and using Pocock et al. (1994) approximately 1.1–2.2 points. The 3 IQ point target was at the upper bound of what the literature supported.


Nonlinear Dose-Response

Post-TLC research confirmed the supralinear dose-response relationship — the greatest IQ loss per unit of lead occurs at the lowest blood lead levels:

  • Canfield et al. (2003): 7.4 IQ points lost per 10 µg/dL at BLL 1–10 µg/dL, but only 4.6 points across the full range
  • Lanphear et al. (2005): The largest incremental deficit (6 IQ points) occurred with the first 10 µg/dL increase

The TLC trial, operating in the 20–44 µg/dL range, was attempting to detect an effect in the portion of the dose-response curve where effects per unit change are smallest.


Why This Matters

The 96% power claim serves a specific rhetorical function: it forecloses the possibility that the null result was due to insufficient statistical power. If the trial had 96% power, the argument goes, a true 3-point effect would almost certainly have been detected. But the claim rests on an unreported R² that is nearly triple the design assumption and well outside the range established by the developmental literature.

Compounding this, the 3 IQ point target was itself optimistic for this population. The dose-response literature available in 1994 — particularly Schwartz (1994) — predicted smaller effects in disadvantaged populations at higher BLLs. If the true expected effect was closer to 2 IQ points, the trial’s actual power to detect it drops substantially below 96%, regardless of the covariate adjustment.

A trial that was underpowered to detect a real effect would be expected to produce exactly the result that TLC produced: a null finding. That null finding was then used to conclude that chelation does not work — a conclusion that requires adequate power to support. See also: BLL Separation Collapsed, Treatment Endpoint.

Source documents referenced on this page are available in the TLC Reference Library.