Panic Headlines: Alcohol Changes Face Shape, Induction Lowers IQ
Data to the rescue
Welcome to another entry in our new Panic Headlines series. Today we’re going to tackle prenatal alcohol usage and child face shape, and induced labor and IQ.
Both headlines this week are from the U.K. press. U.K.! Do better!
Reminder: please send us any panic headlines you need unpacked — ask@parentdata.org.
“Wine before pregnancy ‘changes baby’s face’”
A few weeks ago we had this headline in The Telegraph. The first sentence of the article does not disappoint: “Drinking just one small glass of wine a week in the three months before pregnancy may alter the face of your child, a study suggests.” The author follows up with this: “Scientists caution that the face acts as a ‘health mirror’ and the findings could indicate some deeper health issues…”
The study on which this is based appeared in the journal Human Reproduction. In short, the authors use data from a study called the Generation R Study, run in the Netherlands. In it, about 10,000 women are asked many pregnancy questions — including about alcohol consumption before and during pregnancy — and there are a lot of measurements taken of their children. For the purposes of this particular study, they use 3-D face images of children at ages 9 and 13.
The authors generate 200 face measurements (“traits”). They then analyze the relationship between these traits and prenatal alcohol exposure, including both drinking during pregnancy and before pregnancy. They find that, at age 9, several traits are significantly associated with prenatal alcohol exposure, including with pre-pregnancy exposure.
To understand what I see as the key problem with this paper, it is important to understand what we mean by “significant.” You’ll often see papers talk about a result being “significant at the 5% level.” Colloquially, people think of this as meaning the result is real or correct — this is how you know the result is to be believed.
However: there is a formal meaning here. A p-value of 5% means that if the true effect was zero, only 5% of the time would you expect to see an effect of this size. Put differently: imagine a setting in which a treatment had no effect on the outcome. If you analyzed the relationship with 100 different samples, 5% of the time you’d expect to get a significant relationship, just by chance. This is because of how sampling works.
This is relevant here. Imagine that, in reality, there were no differences across groups in their facial features. However: if we study 200 different features, we’ll expect to see at least 10 of them where a relationship is significant at the 5% level. Just by chance.
The authors know this, of course, and they run a standard adjustment for multiple hypothesis testing. But such adjustments are somewhat ad hoc, and there remains the real concern that when you look at so many different outcomes, across multiple age groups, you’re bound to find something.
Given this concern, it’s important to look for evidence of consistency in the results. If we were consistently seeing one facial feature jump out, across age groups and variations in the analysis, this would provide added confidence.
In fact, the results are all over the place. A few examples…
They run analysis that relies on a binary measure of alcohol exposure and analysis that relies on a continuous measure of exposure. The two analyses pick up different facial features, making it look like drinking at all affects one thing, but drinking more affects something else (and not the first thing).
The effects only show up at age 9, not at age 13.
The data includes multiple ethnic groups. When they limit to only those of Dutch nationality, the effects are not only less significant but also different facial features show up.
There are a set of facial traits that are associated with exposure before pregnancy. However, some of these do not show up for babies who are also exposed during pregnancy. For this to hang together, it must be that the exposure during pregnancy somehow cancels out the exposure before.
This is all challenged as well by the fact that there is no theory behind it. There are reasons to think that heavy alcohol exposure during pregnancy could change face shape, but it is unclear by what biological mechanism exposure before pregnancy would matter.
In the end, this paper feels like a data-mining exercise. There is nothing consistent to hang our hat on. It’s scare-mongering by way of a version of p-hacking. Bad!
“Children born after induced labour ‘may score lower in tests at 12’”
This one is from The Guardian. The paper is in the Acta Obstetricia et Gynecologica Scandinavica.
The data for this paper is a great example of what is possible in some European countries with large and comprehensive data registries. It comes from the Netherlands. The authors have information on effectively all children born during the period from 2003 through 2008, linked to a school test score performance at age 12. They see a lot of details about their birth — gestational age at birth, birth weight, other birth complications, and, importantly, whether the birth was induced. There are over 200,000 children analyzed — the scope of the data is really cool!
The authors analyze what they define as uncomplicated births — births with a head-down presentation, not complicated by hypertensive disorders, diabetes, or very small birth weight. They also excluded children with congenital abnormalities. And they excluded children who were not white (“to improve homogeneity of the data,” as the authors note), which both adds bias and is emblematic of a problematic focus in many research projects on non-diverse populations.
With this sample of births, they estimate the relationship between induction of labor and child test score at age 12 for each week of gestational age. So, for example, they take children born at 39 weeks of gestational age and compare those who were born after induced labor with those who were born after spontaneous labor.
The authors find that children born after induced labor have slightly lower test scores than those born after spontaneous labor.
There are two problems. First: there are demographic differences in induced and non-induced groups. The authors do not report that directly, but we can see that it must be true based on their results. Below, I show (for the babies born at 38 weeks of gestation) the relationship between induced labor and test scores. I show the raw relationship in the data; the relationship when they adjust for maternal age and socioeconomic status; and the relationship when they further adjust for maternal education.
As controls are added, the relationship gets smaller. Even very basic controls here — maternal education is in just three categories — make a huge difference in the effects. This tells us that there must be large differences in maternal education (and all the other variables) across induced and non-induced births. It also should make us concerned that there are other differences across these births (things like household income) that might be driving the results, but which the authors cannot adjust for.
The other problem: although the authors exclude births with certain complications, in many cases there is a reason that a birth is induced. And especially at 37 or 38 weeks of pregnancy, that reason may well be something about complications in pregnancy. Without controlling for these complications — indeed, without observing them — we introduce another source of bias in the results.
Bottom line: this is a pretty standard correlation-isn’t-causation problem. Sigh. It’s a good example of how even great data cannot rescue a deeply flawed premise.
Concluding thoughts
Lessons for the day:
With enough statistical tests you can find your way to a significant result, but that doesn’t mean there is a true causal effect.
Correlation is still not causation.
The British press is no more data-literate than the American press.
Until next time!
Thank you for taking down the statistical meddling in the fetal alcohol study. As an AI researcher in the social sciences, I'll note that I also found their use of a (as far as I could tell) previously unvalidated AI algorithm to make these judgments extremely problematic as well. When based entirely on photographs with no additional information about FAS symptoms among this population of children, this AI can only possibly be as good as the researchers who trained it at recognizing so-called facial characteristics of FAS. Masking this fact by pretending the algorithm is uncovering something that humans did not train it to see in the first place is an excellent example of why we need to be extremely wary of uncritical technological solutionism.
All in all, as a fellow researcher, I am mildly disgusted that these authors chose to add their voices to the existing medical paternalism around alcohol and pregnancy. They could have presented this study more honestly for what it was: an interesting pilot for a technology that may help recognize cases of FAS that might otherwise go undiagnosed, in the context of other symptoms, but that pretty clearly shouldn't be used on its own to diagnose FAS.
I sometimes wonder if the way we teach scientific and intellectual history (“Everyone thought a certain way. Such and such a genius said something different and was widely persecuted/ridiculed. The lone dissenter turned out to be correct. How foolish the scientific/intellectual community was then, unlike us enlightened folk.”) has something to do with exactly how bad the general public is with scientific and media literacy. People are primed to think that every ill-conceived, fringe view, widely rejected by the scientific community is the work of the next Galileo. It’s almost like the more maligned the idea is, the more likely they are to think the person behind it is a misunderstood genius or being actively suppressed for speaking “the truth.” Is trying to teach looking for authoritative sources or scholarly consensus counterproductive when what makes these ideas so appealing to a subset of people that they *are* controversial? What tools do we have for teaching people to sort out the novel from the “horseshit?”