The TLC trial may have been underpowered for two independent reasons: the post-hoc 96% power claim requires statistically implausible assumptions, and the 3 IQ point detectable difference was at the upper bound of what the literature supported for this population. Together, these issues suggest the trial had substantially less ability to detect a real treatment effect than the investigators reported.
Protocol vs. Published Assumptions
The power assumptions changed across three documents, each reporting different values for the same trial:
The protocol (Version 10, Section 10.1) assumed 1,040 children at follow-up; actual enrollment was 780 (Rogan et al. 2001). The sample size was lower than the protocol assumption, not higher. This discrepancy is not reconciled in any publication.
"Our study was designed to have 82 percent power to detect a difference of three points between the treatment groups in IQ scores at three years of follow-up. The actual power of the study was 96 percent, because the number of children with data at 36 months of follow-up was higher than expected and the correlation between base-line and follow-up psychometric tests was better than expected."
— Rogan WJ, Dietrich KN, Ware JH, et al. N Engl J Med. 2001;344(19):1421–1426.
We back-calculated the covariate R² required to reach 96% power using the standard ANCOVA power formula with the published parameters (n ≈ 780, δ = 3 points, α = 0.05, two-sided). The result: R² ≈ 0.45 — nearly triple the R² = 0.16 assumed by the TLC Group (1998, p. 320). This is our calculation derived from the investigators' own published numbers.
The achieved baseline-outcome R² was never reported in any TLC publication. The 96% power claim cannot be independently verified from the published data.
BSID-to-WPPSI Correlations in the Literature
An R² of 0.45 requires a correlation of r ≈ 0.67 between baseline developmental scores (BSID-II at 18–24 months) and the 36-month IQ outcome (WPPSI-R). Published studies demonstrate this is implausibly high:
Study
Age at BSID
Age at IQ
R²
Kvestad et al. 2022 (Nepal, n=529)
18–23 months
4 years
0.20
Kvestad et al. 2022 (Nepal, n=529)
30–35 months
4 years
0.36
Koshy et al. 2024 (India, n=251)
24 months
5 years
0.16–0.26
Koshy et al. 2024 (India, n=251)
36 months
5 years
0.23–0.29
The R² = 0.16 assumed by the TLC Group (1998) was appropriate and consistent with developmental literature. The "better than expected" correlation claimed by Rogan et al. (2001) yielding 96% power is not supported by typical BSID-to-WPPSI correlations.
The 0.83 Correlation Is Misleading
Rogan et al. (2001) report a correlation of 0.83 between 18-month and 36-month WPPSI-R scores. This figure may appear to support the high-power claim, but it is a same-instrument correlation over a short interval — used for imputation in 32 children who had WPPSI scores at 18 months but not 36 months.
This is not the baseline BSID-II to 36-month WPPSI-R correlation that would drive power in the ANCOVA model. Same-instrument test-retest correlations are expected to be high and are not informative about cross-instrument predictive validity.
Paternal IQ Was Not Measured
The ANCOVA model adjusted for maternal IQ but paternal IQ was never collected. This matters because:
Spouse IQ correlations average r ≈ 0.40 — substantially higher than for personality (~0.10) or physical traits (~0.20)
Maternal and paternal IQ associations with offspring outcomes are similar in magnitude, and both independently predict child outcomes
IQ heritability is approximately 40–50% in childhood
Controlling only for maternal IQ leaves residual genetic confounding and further undermines the plausibility of achieving R² ≈ 0.45 with the available covariates.
The 3 IQ Point Target
Independent of the power claim, the TLC Group (1998, p. 320) powered the trial to detect a 3 IQ point difference. Three key meta-analyses were available when TLC was designed:
Needleman & Gatsonis (1990): Reported partial correlation coefficients (r = −0.15, 95% CI ±0.05) but explicitly stated: "Neither approach provides an overall estimate of the raw effect size, ie, of the average change in IQ units per unit change in lead exposure."
Pocock et al. (1994): Found that a doubling of blood lead from 10 to 20 µg/dL was associated with a mean deficit of 1–2 IQ points.
Schwartz (1994): Found 2.6 IQ points lost per 10 µg/dL increase from 10 to 20 µg/dL. Critically, Schwartz found the effect varied by population:
Subgroup
IQ Points Lost per 10 µg/dL
Studies with mean BLL ≤15 µg/dL
3.23 ± 1.26
Studies with mean BLL >15 µg/dL
2.32 ± 0.40
Non-disadvantaged populations
2.89 ± 0.50
Disadvantaged populations
1.85 ± 0.92
For the TLC population — baseline BLL 26.2 µg/dL (Rogan et al. 2001, Table 1), predominantly low-income and minority (TLC Group 1998, Table 2) — the expected IQ gain using Schwartz (1994) for disadvantaged populations is approximately 2.1 points, and using Pocock et al. (1994) approximately 1.1–2.2 points. The 3 IQ point target was at the upper bound of what the literature supported.
Nonlinear Dose-Response
Post-TLC research confirmed the supralinear dose-response relationship — the greatest IQ loss per unit of lead occurs at the lowest blood lead levels:
Canfield et al. (2003): 7.4 IQ points lost per 10 µg/dL at BLL 1–10 µg/dL, but only 4.6 points across the full range
Lanphear et al. (2005): The largest incremental deficit (6 IQ points) occurred with the first 10 µg/dL increase
The TLC trial, operating in the 20–44 µg/dL range, was attempting to detect an effect in the portion of the dose-response curve where effects per unit change are smallest.
Why This Matters
The 96% power claim serves a specific rhetorical function: it forecloses the possibility that the null result was due to insufficient statistical power. If the trial had 96% power, the argument goes, a true 3-point effect would almost certainly have been detected. But the claim rests on an unreported R² that is nearly triple the design assumption and well outside the range established by the developmental literature.
Compounding this, the 3 IQ point target was itself optimistic for this population. The dose-response literature available in 1994 — particularly Schwartz (1994) — predicted smaller effects in disadvantaged populations at higher BLLs. If the true expected effect was closer to 2 IQ points, the trial's actual power to detect it drops substantially below 96%, regardless of the covariate adjustment.
A trial that was underpowered to detect a real effect would be expected to produce exactly the result that TLC produced: a null finding. That null finding was then used to conclude that chelation does not work — a conclusion that requires adequate power to support. See also: BLL Separation Collapsed, Treatment Endpoint.
Source documents referenced on this page are available in the TLC Reference Library.