Statistical Power
The TLC trial may have been underpowered for two independent reasons: the post-hoc 96% power claim requires statistically implausible assumptions, and the 3 IQ point detectable difference was at the upper bound of what the literature supported for this population. Together, these issues suggest the trial had substantially less ability to detect a real treatment effect than the investigators reported.
Protocol vs. Published Assumptions
The power assumptions changed across three documents, each reporting different values for the same trial:
| Parameter | Value |
|---|---|
| Protocol (Version 10, Section 10.1) | |
| Assumed SD | 15 points |
| Assumed n at 36 months | 1,040 |
| Covariate adjustment | Unadjusted |
| Design power | 90% (or 98% with SD=12) |
| TLC Group (1998) | |
| Assumed SD | 14 points |
| Assumed n at 36 months | 608 (78% of 780) |
| Covariate adjustment | ANCOVA with R² = 0.16 |
| Design power | 82% |
| Rogan et al. (2001) | |
| Assumed SD | Not stated |
| Assumed n at 36 months | “Higher than expected” |
| Covariate adjustment | “Better than expected” |
| Actual power | 96% |
The protocol (Version 10, Section 10.1) assumed 1,040 children at follow-up; actual enrollment was 780 (Rogan et al. 2001). The sample size was lower than the protocol assumption, not higher. This discrepancy is not reconciled in any publication.
The 96% Power Claim
The NEJM paper states:
“Our study was designed to have 82 percent power to detect a difference of three points between the treatment groups in IQ scores at three years of follow-up. The actual power of the study was 96 percent, because the number of children with data at 36 months of follow-up was higher than expected and the correlation between base-line and follow-up psychometric tests was better than expected.”
We back-calculated the covariate R² required to reach 96% power using the standard ANCOVA power formula with the published parameters (n ≈ 780, δ = 3 points, α = 0.05, two-sided). The result: R² ≈ 0.45 — nearly triple the R² = 0.16 assumed by the TLC Group (1998, p. 320). This is our calculation derived from the investigators’ own published numbers.
The achieved baseline-outcome R² was never reported in any TLC publication. The 96% power claim cannot be independently verified from the published data.
BSID-to-WPPSI Correlations in the Literature
An R² of 0.45 requires a correlation of r ≈ 0.67 between baseline developmental scores (BSID-II at 18–24 months) and the 36-month IQ outcome (WPPSI-R). Published studies demonstrate this is implausibly high:
| Study | Age at BSID | Age at IQ | R² |
|---|---|---|---|
| Kvestad et al. 2022 (Nepal, n=529) | 18–23 months | 4 years | 0.20 |
| Kvestad et al. 2022 (Nepal, n=529) | 30–35 months | 4 years | 0.36 |
| Koshy et al. 2024 (India, n=251) | 24 months | 5 years | 0.16–0.26 |
| Koshy et al. 2024 (India, n=251) | 36 months | 5 years | 0.23–0.29 |
The R² = 0.16 assumed by the TLC Group (1998) was appropriate and consistent with developmental literature. The “better than expected” correlation claimed by Rogan et al. (2001) yielding 96% power is not supported by typical BSID-to-WPPSI correlations.
The 0.83 Correlation Is Misleading
Rogan et al. (2001) report a correlation of 0.83 between 18-month and 36-month WPPSI-R scores. This figure may appear to support the high-power claim, but it is a same-instrument correlation over a short interval — used for imputation in 32 children who had WPPSI scores at 18 months but not 36 months.
This is not the baseline BSID-II to 36-month WPPSI-R correlation that would drive power in the ANCOVA model. Same-instrument test-retest correlations are expected to be high and are not informative about cross-instrument predictive validity.
Paternal IQ Was Not Measured
The ANCOVA model adjusted for maternal IQ but paternal IQ was never collected. This matters because:
- Spouse IQ correlations average r ≈ 0.40 — substantially higher than for personality (~0.10) or physical traits (~0.20)
- Maternal and paternal IQ associations with offspring outcomes are similar in magnitude, and both independently predict child outcomes
- IQ heritability is approximately 40–50% in childhood
Controlling only for maternal IQ leaves residual genetic confounding and further undermines the plausibility of achieving R² ≈ 0.45 with the available covariates.
The 3 IQ Point Target
Independent of the power claim, the TLC Group (1998, p. 320) powered the trial to detect a 3 IQ point difference. Three key meta-analyses were available when TLC was designed:
Needleman & Gatsonis (1990): Reported partial correlation coefficients (r = −0.15, 95% CI ±0.05) but explicitly stated: “Neither approach provides an overall estimate of the raw effect size, ie, of the average change in IQ units per unit change in lead exposure.”
Pocock et al. (1994): Found that a doubling of blood lead from 10 to 20 µg/dL was associated with a mean deficit of 1–2 IQ points.
Schwartz (1994): Found 2.6 IQ points lost per 10 µg/dL increase from 10 to 20 µg/dL. Critically, Schwartz found the effect varied by population:
| Subgroup | IQ Points Lost per 10 µg/dL |
|---|---|
| Studies with mean BLL ≤15 µg/dL | 3.23 ± 1.26 |
| Studies with mean BLL >15 µg/dL | 2.32 ± 0.40 |
| Non-disadvantaged populations | 2.89 ± 0.50 |
| Disadvantaged populations | 1.85 ± 0.92 |
For the TLC population — baseline BLL 26.2 µg/dL (Rogan et al. 2001, Table 1), predominantly low-income and minority (TLC Group 1998, Table 2) — the expected IQ gain using Schwartz (1994) for disadvantaged populations is approximately 2.1 points, and using Pocock et al. (1994) approximately 1.1–2.2 points. The 3 IQ point target was at the upper bound of what the literature supported.
Nonlinear Dose-Response
Post-TLC research confirmed the supralinear dose-response relationship — the greatest IQ loss per unit of lead occurs at the lowest blood lead levels:
- Canfield et al. (2003): 7.4 IQ points lost per 10 µg/dL at BLL 1–10 µg/dL, but only 4.6 points across the full range
- Lanphear et al. (2005): The largest incremental deficit (6 IQ points) occurred with the first 10 µg/dL increase
The TLC trial, operating in the 20–44 µg/dL range, was attempting to detect an effect in the portion of the dose-response curve where effects per unit change are smallest.
Why This Matters
The 96% power claim serves a specific rhetorical function: it forecloses the possibility that the null result was due to insufficient statistical power. If the trial had 96% power, the argument goes, a true 3-point effect would almost certainly have been detected. But the claim rests on an unreported R² that is nearly triple the design assumption and well outside the range established by the developmental literature.
Compounding this, the 3 IQ point target was itself optimistic for this population. The dose-response literature available in 1994 — particularly Schwartz (1994) — predicted smaller effects in disadvantaged populations at higher BLLs. If the true expected effect was closer to 2 IQ points, the trial’s actual power to detect it drops substantially below 96%, regardless of the covariate adjustment.
A trial that was underpowered to detect a real effect would be expected to produce exactly the result that TLC produced: a null finding. That null finding was then used to conclude that chelation does not work — a conclusion that requires adequate power to support. See also: BLL Separation Collapsed, Treatment Endpoint.
Source documents referenced on this page are available in the TLC Reference Library.