Abstract
Using a single pair or a small number of pairs of composite stimuli to assess the relationship between morphology and other traits is a common method in face research. Here, I use data simulation to demonstrate how this method inevitably leads to a high false positive rate, and how this problem is made worse by using a larger number of raters. I conclude by suggesting alternative methods for assessing the relationship between face morphology and individual traits.
Composite images (also called average or prototype images) can be created by morphing with software such as Psychomorph (Benson & Perrett, 1991a, 1991b, 1993; Tiddeman et al., 2001) or WebMorph (DeBruine, 2018, 2022). They can be a useful way to visualise the differences between groups of images (Figure 1).
Many studies have used composite images to investigate the link between face morphology and various traits, such as cooperation in an economic game (Little et al., 2013), sexual strategies (Boothroyd et al., 2008), voice pitch (Feinberg et al., 2005), ability to elicit gaze-cueing (Jones et al., 2011), or dark triad personality traits (Alper et al., 2021; Holtzman, 2011). Typically, a pair of face composites is created from two groups, such as cooperators and defectors in a prisoners’ dilemma game, or faces are rank-ordered on a continuous trait, such as score on a narcissism questionnaire, and some proportion of the top and bottom scorers are averaged together. Then, raters either rate the individual composites or assess which composite in the pair appears higher on some judgement of interest. This judgement can be related to the difference between the composites, such as cooperativeness or narcissism, or it can be another judgement that is hypothesised to be associated with the difference, such as attractiveness or dominance. If one image in the pair is rated significantly higher than the other image on the judgement in question, this is taken as evidence that the trait is associated with face morphology eliciting that judgement.
Despite its common use in face research, my own past research included, here I argue that this method produces extremely high false positive rates. Under not-unusual conditions, the false positive rate can near 50% for directional hypotheses and 100% for non-directional hypotheses.
To explain why, I’ll start with an analogy that has nothing to do with faces (bear with me). Imagine a researcher predicts that women born on odd days are taller than women born on even days. Ridiculous, right? So let’s simulate some data assuming that isn’t true (see https://github.com/debruine/composites for the code used to create the examples in this paper). We will sample 20 women from a population with a mean height of 162 cm and a standard deviation of 7 (values for women in Scotland). Half are born on odd days and half on even days.
A t-test shows no significant difference (\(t_{13.42}\) = 1.23, \(p\) = .121, \(d\) = 0.55), which is unsurprising. We simulated the data from the same distribution, so we know for sure there is no real difference here.
Now we’re going to average the height of the women with odd and even birthdays. So if we create a full-body composite of women born on odd days, she would be 165.8 cm tall, and a composite of women born on even days would be 160.9 cm tall. If we ask raters to look at these two composites, side-by-side, and judge which one looks taller, what do you imagine would happen? It’s likely that nearly all of them would judge the odd-birthday composite as taller.
But let’s say that raters have to judge the composites independently, and they are pretty bad with height estimation, so their estimates for each composite have error with a standard deviation of 10 cm. We can simulate such ratings from 50 raters and then compare the estimates for the odd-birthday composite with the estimates for the even-birthday composite.
Now the women with odd birthdays are significantly taller than the women with even birthdays (\(t_{49}\) = 2.61, \(p\) = .006, \(d\) = 0.53)! What changed? Essentially, we’re no longer testing whether women born on odd days are taller than those born on even days, but whether raters can perceive the chance difference in height between the pair of composites. As long as there is any difference between the composites that exceeds the perceptual threshold for detection, we can find a significant result with enough raters. The effect has a 50% chance of being in the predicted direction, and whatever result we find with this face pair is likely to be highly replicable in a new set of raters rating the same face pair.
Maybe this is just a fluke of the original sample? We can repeat the procedure above 10000 times and check the p-values of the individual analysis versus the composite method. We can see that the individual method has the expected uniform distribution of p-values (Figure 2), as there is no difference between the two groups. The proportion of false positives is 4.97%, which is close to the alpha criterion of 0.05. However, the composite method produced a false positive rate of 18.4% with a directional hypothesis, and 28.1% with a non-directional hypothesis. And as we’ll see later, you can increase the false positive rate to near 50% for directional hypotheses and 100% for non-directional hypotheses by increasing the number of raters.
A recent paper by Alper et al. (2021) used faces from the Faceaurus database (Holtzman, 2011) to test whether dark triad personality traits (Machiavellianism, narcissism, and psychopathy) are visible in the face. “Holtzman (2011) standardized the assessment scores, computed average scores of self- and peer-reports, and ranked the face images based on the resulting scores. Then, prototypes for each of the personality dimensions were created by digitally combining 10 faces with the highest, and 10 faces with the lowest scores on the personality trait in question (Holtzman, 2011).” This was done separately for male and female faces.
With 105 raters, Holtzman found that the ability to detect the composite higher in a dark triad trait was greater than chance for all three traits for both genders investigated. Alper and colleagues replicated these findings in three studies with rater numbers of 160, 318, and 402, the larger two of which were pre-registered.
While I commend both Holtzman (2011) and Alper et al. (2021) for their transparency, data sharing, and material sharing, I argue that the original test has an effective N of 2, not 105, and that further replications using these images, regardless of number of raters or preregistered status, lend no further weight of evidence to the assertion that dark triad traits are visible in physical appearance.
To explain why, let’s simulate 100 datasets of self- and peer-assesed dark triad scores with the same structure as the original study. Each simulated dataset will have 48 women and 33 men whose Machiavellian, narcissism, NPD, and psychopathy scores are correlated in the same way as Holtzman (2011).
Next, calculate the average dark triad score for each subject and create a “dark triad face morphology” score to represent the extent to which each subject’s face is perceived as high in dark triad traits. Importantly, in this simulation, the face morphology score will have zero correlation to the average dark triad score. Individual samples will, of course, show non-zero correlations between facial morphology and dark triad traits by chance alone, but these will tend to be small and non-directional.
Now pick the 10 images with the highest and lowest scores for each trait for each gender and create composites of these images. However, since scores on the three dark triad traits are positively correlated, the three pairs of composite faces are not independent. Indeed, Holtzman (2011) states that five individuals were in all three low composites for the male faces, while the overlap was less extreme in other cases.
Here, we will assume that the face morphology that leads to perceptions of dark triad traits can be linearly combined. Even though this face morphology is totally unrelated to the dark triad personality scores, the composites will still differ in this morphology, some more than others, and half the time in the predicted direction.
Following Holtzman, we will simulate raters for each replicate giving -5 to +5 ratings for which face in each pair looks more Machiavellian, narcissistic, or psychopathic. Each pairing will be rated twice by each rater.
By chance alone, some of the values will be significant in the predicted direction.
People tend to show high agreement on stereotypical social perceptions from the physical appearance of faces, even when physical appearance is not meaningfully associated with the traits being judged (Jones et al., 2021; Todorov et al., 2008; Zebrowitz & Montepare, 2008). We can be sure that by chance alone, our two composites will be at least slightly different on any measure, even if they are drawn from identical populations.
A naive solution to this problem is to increase the number of raters, which should produce more accurate results, right? Actually, this makes the problem even worse. As you increase the number of raters, the power to detect even small (chance) differences in composites rises (Figure 8). Consequently, you can virtually guarantee significant results, even for tiny differences or traits that people are very bad at estimating.
How likely is it that there will be chance differences in the composites big enough to be a problem? More likely than you probably think, especially when there are a small number of stimuli in each composite. The smaller the number of stimuli that go into each composite, the larger the median (unsigned) size of this difference (Figure 9). With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned effect size of the difference between composites from populations with no real difference is 0.31 (in units of SD of the original trait distribution). If our raters are accurate enough at perceiving this difference, or we run a very large number of raters, we are virtually guaranteed to find significant results every time. There is a 50% chance that these results will be in the predicted direction, and this direction will be replicable across different samples of raters for the same image set.
So what does this mean for studies of the link between personality traits and facial appearance? The analogy with birth date and height holds. As long as there are facial morphologies that are even slightly consistently associated with the perception of a trait, then composites will not be identical in that morphology. Thus, even if that morphology is totally unassociated with the trait as measured by, e.g., personality scales or peer report (which is often the case), using the composite rating method will inflate the false positive rate for concluding a difference.
The smaller the number of stimuli that go into each composite, the greater the chance that they will be visibly different in morphology related to the judgement of interest, just by chance alone. The larger the number of raters or the better raters are at detecting small differences in this morphology, the more likely that “detection” will be significantly above chance. Repeating this with a new set of raters does not increase the amount of evidence you have for the association between the face morphology and the measured trait. You’ve only measured it once in one population of faces. If raters are your unit of analyses, you are making conclusions about whether the population of raters can detect the difference between your stimuli, you cannot generalise this to new stimulus sets.
So how should researchers test for differences in facial appearance between groups? Here I discuss two alternative methods for investigating the relationship between traits and face morphology.
Assessment of individual face images, combined with analysis using mixed effects models, can allow you to simultaneously account for variance in both raters and stimuli, avoiding the inflated false positives of the composite method and the similar problem that occurs when ratings of individual stimuli are averaged before analysis (Barr, 2007). People often use the composite method when they have too many images for any one rater to rate, but cross-classified mixed models can analyse data from counterbalanced trials or randomised subset allocation.
Here we simulate data from a design where 200 faces from two trait groups are rated by 200 raters, in 10 counterbalanced batches, such that each rater only rates 10 faces from each trait group.
The following mixed effects analysis accounts for the structure of the data above. Each rater does not have to rate each face in order for random effects of face and rater to be accounted for. See DeBruine & Barr (2021) for further discussion of the benefits of mixed effects models for this type of experimental design.
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: rating ~ trait + (1 | face) + (1 + trait | rater)
## Data: mixed_df
##
## REML criterion at convergence: 17640.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.3893 -0.6484 0.0019 0.6382 3.5144
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## face (Intercept) 0.9566 0.9781
## rater (Intercept) 0.9121 0.9550
## trait.high-low 1.0014 1.0007 0.63
## Residual 3.8608 1.9649
## Number of obs: 4000, groups: face, 200; rater, 200
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.06356 0.12672 285.93326 -0.502 0.61635
## trait.high-low 0.51804 0.16733 246.90590 3.096 0.00219 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## trat.hgh-lw -0.400
Another reason to use the composite rating method is when you are not ethically permitted to use individual faces in research, but are ethically permitted to use non-identifiable composite images. In this case, you can generate a large number of random composite pairs to construct the chance distribution. The equivalent to a p-value for this method is the proportion of the randomly paired composites that your target pair has a less extreme result than. While this method is too tedious to use when constructing composite faces manually, scripting with webmorphR (DeBruine, 2022) allows you to automate such a task.
Face images are from the open-source, CC-BY licensed image set, the Face Research Lab London Set (DeBruine & Jones, 2017b). All software is available open source. The code to reproduce this paper can be found at https://github.com/debruine/composites.
We used R (Version 4.3.1; R Core Team, 2023) and the R-packages broom (Version 1.0.5; Robinson et al., 2023), dplyr (Version 1.1.2; Wickham, François, et al., 2023), faux (Version 1.2.1; DeBruine, 2023), ggplot2 (Version 3.4.2; Wickham, 2016), glue (Version 1.6.2; Hester & Bryan, 2022), kableExtra (Version 1.3.4; Zhu, 2021), lme4 (Version 1.1.34; Bates et al., 2015), lmerTest (Version 3.1.3; Kuznetsova et al., 2017), Matrix (Version 1.6.0; Bates et al., 2023), papaja (Version 0.1.1; Aust & Barth, 2022), purrr (Version 1.0.1; Wickham & Henry, 2023), pwr (Version 1.3.0; Champely, 2020), tidyr (Version 1.3.0; Wickham, Vaughan, et al., 2023), tinylabels (Version 0.2.3; Barth, 2022), webmorphR (Version 0.1.1; DeBruine, 2022; DeBruine & Jones, 2017a), and webmorphR.stim (Version 0.0.0.9002; DeBruine & Jones, 2017a) to produce this manuscript.