Composite images (also called average or prototype images) can be created by morphing with software such as Psychomorph (Benson & Perrett, 1991a, 1991b, 1993; Tiddeman et al., 2001) or WebMorph (DeBruine, 2018, 2022). They can be a useful way to visualise the differences between groups of images (Figure 1).

Example composite images, each comprising 4 faces of a specific gender and ethnic group.

Figure 1: Example composite images, each comprising 4 faces of a specific gender and ethnic group.

Many studies have used composite images to investigate the link between face morphology and various traits, such as cooperation in an economic game (Little et al., 2013), sexual strategies (Boothroyd et al., 2008), voice pitch (Feinberg et al., 2005), ability to elicit gaze-cueing (Jones et al., 2011), or dark triad personality traits (Alper et al., 2021; Holtzman, 2011). Typically, a pair of face composites is created from two groups, such as cooperators and defectors in a prisoners’ dilemma game, or faces are rank-ordered on a continuous trait, such as score on a narcissism questionnaire, and some proportion of the top and bottom scorers are averaged together. Then, raters either rate the individual composites or assess which composite in the pair appears higher on some judgement of interest. This judgement can be related to the difference between the composites, such as cooperativeness or narcissism, or it can be another judgement that is hypothesised to be associated with the difference, such as attractiveness or dominance. If one image in the pair is rated significantly higher than the other image on the judgement in question, this is taken as evidence that the trait is associated with face morphology eliciting that judgement.

Despite its common use in face research, my own past research included, here I argue that this method produces extremely high false positive rates. Under not-unusual conditions, the false positive rate can near 50% for directional hypotheses and 100% for non-directional hypotheses.

Birthdate and Height

To explain why, I’ll start with an analogy that has nothing to do with faces (bear with me). Imagine a researcher predicts that women born on odd days are taller than women born on even days. Ridiculous, right? So let’s simulate some data assuming that isn’t true (see https://github.com/debruine/composites for the code used to create the examples in this paper). We will sample 20 women from a population with a mean height of 162 cm and a standard deviation of 7 (values for women in Scotland). Half are born on odd days and half on even days.

A t-test shows no significant difference (\(t_{13.42}\) = 1.23, \(p\) = .121, \(d\) = 0.55), which is unsurprising. We simulated the data from the same distribution, so we know for sure there is no real difference here.

Now we’re going to average the height of the women with odd and even birthdays. So if we create a full-body composite of women born on odd days, she would be 165.8 cm tall, and a composite of women born on even days would be 160.9 cm tall. If we ask raters to look at these two composites, side-by-side, and judge which one looks taller, what do you imagine would happen? It’s likely that nearly all of them would judge the odd-birthday composite as taller.

But let’s say that raters have to judge the composites independently, and they are pretty bad with height estimation, so their estimates for each composite have error with a standard deviation of 10 cm. We can simulate such ratings from 50 raters and then compare the estimates for the odd-birthday composite with the estimates for the even-birthday composite.

Now the women with odd birthdays are significantly taller than the women with even birthdays (\(t_{49}\) = 2.61, \(p\) = .006, \(d\) = 0.53)! What changed? Essentially, we’re no longer testing whether women born on odd days are taller than those born on even days, but whether raters can perceive the chance difference in height between the pair of composites. As long as there is any difference between the composites that exceeds the perceptual threshold for detection, we can find a significant result with enough raters. The effect has a 50% chance of being in the predicted direction, and whatever result we find with this face pair is likely to be highly replicable in a new set of raters rating the same face pair.

Maybe this is just a fluke of the original sample? We can repeat the procedure above 10000 times and check the p-values of the individual analysis versus the composite method. We can see that the individual method has the expected uniform distribution of p-values (Figure 2), as there is no difference between the two groups. The proportion of false positives is 4.97%, which is close to the alpha criterion of 0.05. However, the composite method produced a false positive rate of 18.4% with a directional hypothesis, and 28.1% with a non-directional hypothesis. And as we’ll see later, you can increase the false positive rate to near 50% for directional hypotheses and 100% for non-directional hypotheses by increasing the number of raters.

Individual versus composite method. The individual method shows the expected uniform distribution of p-values, while the composite method has an inflated false positive rate.

Figure 2: Individual versus composite method. The individual method shows the expected uniform distribution of p-values, while the composite method has an inflated false positive rate.

Simulating a Real Example

A recent paper by Alper et al. (2021) used faces from the Faceaurus database (Holtzman, 2011) to test whether dark triad personality traits (Machiavellianism, narcissism, and psychopathy) are visible in the face. “Holtzman (2011) standardized the assessment scores, computed average scores of self- and peer-reports, and ranked the face images based on the resulting scores. Then, prototypes for each of the personality dimensions were created by digitally combining 10 faces with the highest, and 10 faces with the lowest scores on the personality trait in question (Holtzman, 2011).” This was done separately for male and female faces.

With 105 raters, Holtzman found that the ability to detect the composite higher in a dark triad trait was greater than chance for all three traits for both genders investigated. Alper and colleagues replicated these findings in three studies with rater numbers of 160, 318, and 402, the larger two of which were pre-registered.

While I commend both Holtzman (2011) and Alper et al. (2021) for their transparency, data sharing, and material sharing, I argue that the original test has an effective N of 2, not 105, and that further replications using these images, regardless of number of raters or preregistered status, lend no further weight of evidence to the assertion that dark triad traits are visible in physical appearance.

Simulating Rating Data

To explain why, let’s simulate 100 datasets of self- and peer-assesed dark triad scores with the same structure as the original study. Each simulated dataset will have 48 women and 33 men whose Machiavellian, narcissism, NPD, and psychopathy scores are correlated in the same way as Holtzman (2011).

Correlation structure of the original data and simulations.

Figure 3: Correlation structure of the original data and simulations.

Simulate No Relationship

Next, calculate the average dark triad score for each subject and create a “dark triad face morphology” score to represent the extent to which each subject’s face is perceived as high in dark triad traits. Importantly, in this simulation, the face morphology score will have zero correlation to the average dark triad score. Individual samples will, of course, show non-zero correlations between facial morphology and dark triad traits by chance alone, but these will tend to be small and non-directional.

The first 8 simulated replicates, showing no systematic relationship between dark triad trait scores and facial morphology.

Figure 4: The first 8 simulated replicates, showing no systematic relationship between dark triad trait scores and facial morphology.

Create Composites

Now pick the 10 images with the highest and lowest scores for each trait for each gender and create composites of these images. However, since scores on the three dark triad traits are positively correlated, the three pairs of composite faces are not independent. Indeed, Holtzman (2011) states that five individuals were in all three low composites for the male faces, while the overlap was less extreme in other cases.

Here, we will assume that the face morphology that leads to perceptions of dark triad traits can be linearly combined. Even though this face morphology is totally unrelated to the dark triad personality scores, the composites will still differ in this morphology, some more than others, and half the time in the predicted direction.

Differences in average dark triad face morphology between the high and low dark triad trait groups for the first 8 replicates.

Figure 5: Differences in average dark triad face morphology between the high and low dark triad trait groups for the first 8 replicates.

The distribution of the difference between high and low composites in average dark triad face morphology, across the 100 replicates. The blue line shows the minumum effect size for which there is 80% power for 105 raters to detect the difference.

Figure 6: The distribution of the difference between high and low composites in average dark triad face morphology, across the 100 replicates. The blue line shows the minumum effect size for which there is 80% power for 105 raters to detect the difference.

Simulate Composite Ratings

Following Holtzman, we will simulate raters for each replicate giving -5 to +5 ratings for which face in each pair looks more Machiavellian, narcissistic, or psychopathic. Each pairing will be rated twice by each rater.

By chance alone, some of the values will be significant in the predicted direction.

Distribution of replicates with 0 to 6 significant results in the predicted direction (one-tailed one-sample t-tests with alpha = 0.05).

Figure 7: Distribution of replicates with 0 to 6 significant results in the predicted direction (one-tailed one-sample t-tests with alpha = 0.05).

People tend to show high agreement on stereotypical social perceptions from the physical appearance of faces, even when physical appearance is not meaningfully associated with the traits being judged (Jones et al., 2021; Todorov et al., 2008; Zebrowitz & Montepare, 2008). We can be sure that by chance alone, our two composites will be at least slightly different on any measure, even if they are drawn from identical populations.

More Raters is Even Worse

A naive solution to this problem is to increase the number of raters, which should produce more accurate results, right? Actually, this makes the problem even worse. As you increase the number of raters, the power to detect even small (chance) differences in composites rises (Figure 8). Consequently, you can virtually guarantee significant results, even for tiny differences or traits that people are very bad at estimating.

Power curves for a one-tailed, one-sample t-test.

Figure 8: Power curves for a one-tailed, one-sample t-test.

How likely is it that there will be chance differences in the composites big enough to be a problem? More likely than you probably think, especially when there are a small number of stimuli in each composite. The smaller the number of stimuli that go into each composite, the larger the median (unsigned) size of this difference (Figure 9). With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned effect size of the difference between composites from populations with no real difference is 0.31 (in units of SD of the original trait distribution). If our raters are accurate enough at perceiving this difference, or we run a very large number of raters, we are virtually guaranteed to find significant results every time. There is a 50% chance that these results will be in the predicted direction, and this direction will be replicable across different samples of raters for the same image set.

Simulated data showing the distribution of effect sizes for the difference between pairs of composites sampled from the same distribution (i.e., no real effect). Points show the median unsigned effect size.

Figure 9: Simulated data showing the distribution of effect sizes for the difference between pairs of composites sampled from the same distribution (i.e., no real effect). Points show the median unsigned effect size.

Implications for Face Research

So what does this mean for studies of the link between personality traits and facial appearance? The analogy with birth date and height holds. As long as there are facial morphologies that are even slightly consistently associated with the perception of a trait, then composites will not be identical in that morphology. Thus, even if that morphology is totally unassociated with the trait as measured by, e.g., personality scales or peer report (which is often the case), using the composite rating method will inflate the false positive rate for concluding a difference.

The smaller the number of stimuli that go into each composite, the greater the chance that they will be visibly different in morphology related to the judgement of interest, just by chance alone. The larger the number of raters or the better raters are at detecting small differences in this morphology, the more likely that “detection” will be significantly above chance. Repeating this with a new set of raters does not increase the amount of evidence you have for the association between the face morphology and the measured trait. You’ve only measured it once in one population of faces. If raters are your unit of analyses, you are making conclusions about whether the population of raters can detect the difference between your stimuli, you cannot generalise this to new stimulus sets.

So how should researchers test for differences in facial appearance between groups? Here I discuss two alternative methods for investigating the relationship between traits and face morphology.

Assessment of individual faces

Assessment of individual face images, combined with analysis using mixed effects models, can allow you to simultaneously account for variance in both raters and stimuli, avoiding the inflated false positives of the composite method and the similar problem that occurs when ratings of individual stimuli are averaged before analysis (Barr, 2007). People often use the composite method when they have too many images for any one rater to rate, but cross-classified mixed models can analyse data from counterbalanced trials or randomised subset allocation.

Here we simulate data from a design where 200 faces from two trait groups are rated by 200 raters, in 10 counterbalanced batches, such that each rater only rates 10 faces from each trait group.

The following mixed effects analysis accounts for the structure of the data above. Each rater does not have to rate each face in order for random effects of face and rater to be accounted for. See DeBruine & Barr (2021) for further discussion of the benefits of mixed effects models for this type of experimental design.

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: rating ~ trait + (1 | face) + (1 + trait | rater)
##    Data: mixed_df
## 
## REML criterion at convergence: 17640.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3893 -0.6484  0.0019  0.6382  3.5144 
## 
## Random effects:
##  Groups   Name           Variance Std.Dev. Corr
##  face     (Intercept)    0.9566   0.9781       
##  rater    (Intercept)    0.9121   0.9550       
##           trait.high-low 1.0014   1.0007   0.63
##  Residual                3.8608   1.9649       
## Number of obs: 4000, groups:  face, 200; rater, 200
## 
## Fixed effects:
##                 Estimate Std. Error        df t value Pr(>|t|)   
## (Intercept)     -0.06356    0.12672 285.93326  -0.502  0.61635   
## trait.high-low   0.51804    0.16733 246.90590   3.096  0.00219 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## trat.hgh-lw -0.400

Random Face Pairs

Another reason to use the composite rating method is when you are not ethically permitted to use individual faces in research, but are ethically permitted to use non-identifiable composite images. In this case, you can generate a large number of random composite pairs to construct the chance distribution. The equivalent to a p-value for this method is the proportion of the randomly paired composites that your target pair has a less extreme result than. While this method is too tedious to use when constructing composite faces manually, scripting with webmorphR (DeBruine, 2022) allows you to automate such a task.

Five random pairs of composites from a sample of 20 faces (10 in each composite). Can you spot any differences?

Figure 10: Five random pairs of composites from a sample of 20 faces (10 in each composite). Can you spot any differences?

Open Resources

Face images are from the open-source, CC-BY licensed image set, the Face Research Lab London Set (DeBruine & Jones, 2017b). All software is available open source. The code to reproduce this paper can be found at https://github.com/debruine/composites.

We used R (Version 4.3.1; R Core Team, 2023) and the R-packages broom (Version 1.0.5; Robinson et al., 2023), dplyr (Version 1.1.2; Wickham, François, et al., 2023), faux (Version 1.2.1; DeBruine, 2023), ggplot2 (Version 3.4.2; Wickham, 2016), glue (Version 1.6.2; Hester & Bryan, 2022), kableExtra (Version 1.3.4; Zhu, 2021), lme4 (Version 1.1.34; Bates et al., 2015), lmerTest (Version 3.1.3; Kuznetsova et al., 2017), Matrix (Version 1.6.0; Bates et al., 2023), papaja (Version 0.1.1; Aust & Barth, 2022), purrr (Version 1.0.1; Wickham & Henry, 2023), pwr (Version 1.3.0; Champely, 2020), tidyr (Version 1.3.0; Wickham, Vaughan, et al., 2023), tinylabels (Version 0.2.3; Barth, 2022), webmorphR (Version 0.1.1; DeBruine, 2022; DeBruine & Jones, 2017a), and webmorphR.stim (Version 0.0.0.9002; DeBruine & Jones, 2017a) to produce this manuscript.

References

Alper, S., Bayrak, F., & Yilmaz, O. (2021). All the dark triad and some of the big five traits are visible in the face. Personality and Individual Differences, 168, 110350. https://doi.org/https://doi.org/10.1016/j.paid.2020.110350
Aust, F., & Barth, M. (2022). papaja: Prepare reproducible APA journal articles with R Markdown. https://github.com/crsh/papaja
Barr, D. J. (2007). Generalizing over encounters. In The oxford handbook of psycholinguistics. Oxford University Press, USA.
Barth, M. (2022). tinylabels: Lightweight variable labels. https://cran.r-project.org/package=tinylabels
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Bates, D., Maechler, M., & Jagan, M. (2023). Matrix: Sparse and dense matrix classes and methods. https://CRAN.R-project.org/package=Matrix
Benson, P. J., & Perrett, D. I. (1991a). Perception and recognition of photographic quality facial caricatures: Implications for the recognition of natural images. European Journal of Cognitive Psychology, 3(1), 105–135.
Benson, P. J., & Perrett, D. I. (1991b). Synthesising continuous-tone caricatures. Image and Vision Computing, 9(2), 123–129.
Benson, P. J., & Perrett, D. I. (1993). Extracting prototypical facial images from exemplars. Perception, 22(3), 257–262.
Boothroyd, L. G., Jones, B. C., Burt, D. M., DeBruine, L. M., & Perrett, D. I. (2008). Facial correlates of sociosexuality. Evolution and Human Behavior, 29, 211–218. https://doi.org/10.1016/j.evolhumbehav.2007.12.009
Champely, S. (2020). Pwr: Basic functions for power analysis. https://CRAN.R-project.org/package=pwr
DeBruine, L. M. (2018). Webmorph: Beta release 2 (Version v0.0.0.9001). Zenodo. https://doi.org/10.5281/zenodo.1162670
DeBruine, L. M. (2022). webmorphR: Reproducible stimuli. https://CRAN.R-project.org/package=webmorphR
DeBruine, L. M. (2023). Faux: Simulation for factorial designs. Zenodo. https://doi.org/10.5281/zenodo.2669586
DeBruine, L. M., & Barr, D. J. (2021). Understanding mixed-effects models through data simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119.
DeBruine, L. M., & Jones, B. C. (2017a). Face research lab london set. figshare. https://doi.org/10.6084/m9.figshare.5047666.v5
DeBruine, L. M., & Jones, B. C. (2017b). Face research lab london set. figshare. https://doi.org/10.6084/m9.figshare.5047666.v5
Feinberg, D. R., Jones, B. C., Little, A. C., Burt, D. M., & Perrett, D. I. (2005). Manipulation of fundamental and formant frequencies influence the attractiveness of human male voices. Animal Behaviour, 69, 561–568. https://doi.org/10.1016/j.anbehav.2004.06.012
Hester, J., & Bryan, J. (2022). Glue: Interpreted string literals. https://CRAN.R-project.org/package=glue
Holtzman, N. S. (2011). Facing a psychopath: Detecting the dark triad from emotionally-neutral faces, using prototypes from the personality faceaurus. Journal of Research in Personality, 45(6), 648–654.
Jones, B. C., DeBruine, L. M., Flake, J. K., Liuzza, M. T., Antfolk, J., Arinze, N. C., Ndukaihe, I. L. G., Bloxsom, N. G., Lewis, S. C., Foroni, F., et al. (2021). To which world regions does the valence–dominance model of social perception apply? Nature Human Behaviour, 5(1), 159–169.
Jones, B. C., Main, J. C., Little, A. C., & DeBruine, L. M. (2011). Further evidence that facial cues of dominance modulate gaze-cuing in human observers. Swiss Journal of Psychology, 70, 193–197.
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13
Little, A. C., Jones, B. C., DeBruine, L. M., & Dunbar, R. I. M. (2013). Accuracy in discrimination of self-reported cooperators using static facial information. Personality and Individual Differences, 54, 507–512. https://doi.org/10.1016/j.paid.2012.10.018
R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Robinson, D., Hayes, A., & Couch, S. (2023). Broom: Convert statistical objects into tidy tibbles. https://CRAN.R-project.org/package=broom
Tiddeman, B. P., Burt, D. M., & Perrett, D. I. (2001). Prototyping and transforming facial textures for perception research. IEEE Computer Graphics and Applications, 21(5), 42–50.
Todorov, A., Said, C. P., Engell, A. D., & Oosterhof, N. N. (2008). Understanding evaluation of faces on social dimensions. Trends in Cognitive Sciences, 12(12), 455–460.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr
Wickham, H., & Henry, L. (2023). Purrr: Functional programming tools. https://CRAN.R-project.org/package=purrr
Wickham, H., Vaughan, D., & Girlich, M. (2023). Tidyr: Tidy messy data. https://CRAN.R-project.org/package=tidyr
Zebrowitz, L. A., & Montepare, J. M. (2008). Social psychological face perception: Why appearance matters. Social and Personality Psychology Compass, 2(3), 1497–1517.
Zhu, H. (2021). kableExtra: Construct complex table with ’kable’ and pipe syntax. https://CRAN.R-project.org/package=kableExtra