Replicability and Generalisability in Face Research

Lisa DeBruine


In this talk, I will discuss several initiatives to increase the replicability and generalisability of research on faces, with a special focus on big team science efforts, such as the Psychological Science Accelerator and ManyFaces. I will also make an argument for reproducible stimulus construction and introduce webmorphR, an R package for reproducibly scripting face stimulus creation. Additionally, I will explain how a common methodology in face research, the composite method, produces very high false positive rates, and explain alternatives to this, including the use of mixed effects models for analysing individual face ratings.

Psychological Science Accelerator

Jones, B.C., DeBruine, L.M., Flake, J.K. et al. (2021). To which world regions does the valence–dominance model of social perception apply?. Nature Human Behaviour 5, 159–169.

Which face looks more trustworthy?

Which face looks more trustworthy?

Which face looks more responsible?

Which face looks more responsible?

Which face looks more dominant?

Which face looks more dominant?

Which face looks more intelligent?

Which face looks more intelligent?

Which face looks more caring?

Which face looks more caring?

Face Ratings

  1. Attractive
  2. Weird
  3. Mean
  4. Trustworthy
  5. Aggressive
  6. Caring
  7. Emotionally stable
  1. Unhappy
  2. Responsible
  3. Sociable
  4. Dominant
  5. Confident
  6. Intelligent

Todorov et al. (2008)

Valence-Dominance Model

How sociable (i.e., friendly or agreeable in company; companionable) is this person?
not at all

Hoe sociabel (d.w.z. vriendelijk of prettig in de omgang, gezellig) is deze persoon?
helemaal niet
helemaal erg

Study Stats

  • >3M data points
  • 12,660 participants
  • 11,570 post-exclusion
  • 126 labs
  • 44 countries
  • 25 languages
  • 243 authors


Ben Jones

Lisa DeBruine

Jess Flake

Patrick Forscher

Nicholas Coles

Chris Chartier


PCA Loadings

Original Data

Western Europe

PCA Loadings

Principal Components Analysis shows little regional variability

EFA Loadings

Exploratory Factor Analysis shows more regional variability

Secondary Data Challenge

  • Examining the “attractiveness halo effect” - Carlotta Batres, Victor Shiramizu (Current Psychology)
  • Region- and Language-Level ICCs for Judgments of Faces - Neil Hester and Eric Hehman (Psychological Science)
  • Variance & Homogeneity of Facial Trait Space Across World Regions - Sally Xie and Eric Hehman (Psychological Science)
  • The Facial Width-to-Height Ratio (fWHR) and Perceived Dominance and Trustworthiness: Moderating Role of Social Identity Cues (Gender and Race) and Ecological Factor (Pathogen Prevalence) - Subramanya Prasad Chandrashekar
  • Is facial width-to-height ratio reliably associated with social inferences? A large cross-national examination - Patrick Durkee and Jessica Ayers
  • Population diversity is associated with trustworthiness impressions from faces - Jared Martin, Adrienne Wood, and DongWon Oh (Psychological Science)
  • Do regional gender and racial biases predict gender and racial biases in social face judgments? - DongWon Oh and Alexander Todorov
  • Hierarchical Modelling of Facial Perceptions: A Secondary Analysis of Aggressiveness Ratings - Mark Adkins, Nataly Beribisky, Stefan Bonfield, and Linda Farmus

Blog Post


ManyFaces is a recently formed big team science group for face perception and face recognition research.

Stimulus Meta-Database

Stimulus Collection

Face image sets tend to suffer from one or more of:

  1. a lack of age and ethnic diversity
  2. insufficient diversity of poses or expressions
  3. a lack of standardisation (e.g., different lighting, backgrounds, camera-to-head distance, and other photographic properties) that makes it impossible to combine image sets
  4. restricted ability to share
  5. unethical procurement

Protocol Development

Kit (~£800 total)

📷 camera: Canon EOS 250D Digital SLR Camera with 18-55mm IS STM Lens (£649)

💾 memory card: SanDisk 32GB SDHC Card (£9)

🌈 color checker: Calibrite ColorChecker Classic ~A4 (£66)

💡 stand/light: Fovitec Bi-Colour LED Ring Light Kit (£71)

Reproducible Stimuli

DeBruine, L. M., Holzleitner, I. J., Tiddeman, B., & Jones, B. C. (2022). Reproducible Methods for Face Research. PsyArXiv.

Vague Methods

Each of the images was rendered in gray-scale and morphed to a common shape using an in-house program based on bi-linear interpolation (see e.g., Gonzalez & Woods, 2002). Key points in the morphing grid were set manually, using a graphics program to align a standard grid to a set of facial points (eye corners, face outline, etc.). Images were then subject to automatic histogram equalization. (Burton et al. 2005, 263)


These pictures were edited using Adobe Photoshop 6.0 to remove external features (hair, ears) and create a uniform grey background. (Sforza et al. 2010, 150)

The averaged composites and blends were sharpened in Adobe Photoshop to reduce any blurring introduced by blending. (Rhodes et al. 2001, 615)

Scriptable Methods

The average pixel intensity of each image (ranging from 0 to 255) was set to 128 with a standard deviation of 40 using the SHINE toolbox (function lumMatch) (Willenbockel et al., 2010) in MATLAB (version, R2013a). (Visconti di Oleggio Castello et al. 2014, 2)

We used the GraphicConverterTM application to crop the images around the cat face and make them all 1024x1024 pixels. One of the challenges of image matching is to do this process automatically. (Paluszek and Thomas 2019, 214)

Commerical morphing

The faces were carefully marked with 112 nodes in FantaMorph™, 4th version: 28 nodes (face outline), 16 (nose), 5 (each ear), 20 (lips), 11 (each eye), and 8 (each eyebrow). To create the prototypes, I used FantaMorph Face Mixer, which averages node locations across faces. Prototypes are available online, in the Personality Faceaurus []. (Holtzman 2011a, 650)


orig <- demo_stim() # load demo images
mirrored <- mirror(orig)
cropped  <- crop(orig, width = 0.75, height = 0.75)
resized  <- resize(orig, 0.75)
rotated  <- rotate(orig, degrees = c(90, 180))
padded   <- pad(orig, 30, fill = c("hotpink", "dodgerblue"))
grey     <- greyscale(orig)



demo_stim() |> mask(fill = "black")

Custom Mask

demo_stim() |>  mask(mask = c("eyes", "mouth"), 
                     fill = "#00000099", 
                     reverse = TRUE)

“Standard” Oval Mask

demo_stim() |> 
  greyscale() |>
  subset_tem(features("face")) |> # ignore hair, neck and ears
  crop_tem(50) |>                 # crop to 50px around template
  mask_oval(fill = "grey40")


faces <- load_stim_neutral(22:26) 
aligned <- faces |> align(fill = "dodgerblue")

c(faces, aligned) |> plot(nrow = 2)

Alignment with Patch Fill

faces |> align(fill = patch(faces))


neu_orig <- load_stim_neutral() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5) 

smi_orig <- load_stim_smiling() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5)

all <- c(neu_orig, smi_orig) |>
  auto_delin("dlib70", replace = TRUE)

aligned <- all |>
  align(procrustes = TRUE, fill = patch(all)) |>
  crop(.6, .8, y_off = 0.05)

neu_avg <- subset(aligned, 1:5) |> avg(texture = FALSE)
smi_avg <- subset(aligned, 6:10) |> avg(texture = FALSE)



steps <- continuum(
  from_img = neu_avg, 
  to_img = smi_avg, 
  from = -0.5, 
  to = 1.5, 
  by = 0.25

Word Stimuli

# make a vector of the words and colours they should print in
colours <- c(red = "red3", 
             green = "darkgreen", 
             blue = "dodgerblue3")

# make vector of labels (each word in each colour)
labels <- names(colours) |> rep(each = 3)

# make blank 800x200px images and add labels
stroop <- blank(3*3, 800, 200) |>
        size = 100, 
        color = colours, 
        weight = 700,
        gravity = "center")

Face Composites

DeBruine, L. M. (2023). The Composite Method Produces High False Positive Rates. PsyArXiv.


People chose the composite of people who self-reported a high probability to cooperate in a prisoners’ dilemma as more likely to cooperate about 62% of the time (Little et al. 2013)

Women’s Height

n <- 10
mean <- 162
sd <- 7

# for reproducible simulation

odd  <- rnorm(n, mean, sd)
even <- rnorm(n, mean, sd)

A t-test shows no significant difference (\(t_{13.42}\) = 1.23, \(p\) = .121, \(d\) = 0.55), which is unsurprising. We simulated the data from the same distribution, so we know for sure there is no real difference here.

Composite Height

Odd Composite Even Composite
165.8 cm 160.9 cm
🧍‍♀️ 🧍‍♀️

Rating Height

rater_n <- 50 # number of raters
error_sd <- 10 # rater error

# for reproducible simulation

# add the error to the composite mean heights
odd_est  <- mean(odd) + 
  rnorm(rater_n, 0, error_sd)
even_est <- mean(even) + 
  rnorm(rater_n, 0, error_sd)

Now the women with odd birthdays are significantly taller than the women with even birthdays (\(t_{49}\) = 2.61, \(p\) = .006, \(d\) = 0.53)!

Is this a fluke?

Individual versus composite method. The individual method shows the expected uniform distribution of p-values, while the composite method has an inflated false positive rate.

Dark Triad

Holtzman (2011a), replicated by Alper, Bayrak, and Yilmaz (2021)

Simulate Composite Ratings

Following Holtzman (2011a), we simulated 100 sets of 6 “image pairs” with no actual difference in appearance, and 105 raters giving -5 to +5 ratings for which face in each pair looks more Machiavellian, narcissistic, or psychopathic. By chance alone, some of the values will be significant in the predicted direction.

More Raters?

Composite Size

With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned difference between composites from populations with no real difference is 0.31 SD.

Mixed Effects Models

DeBruine LM, Barr DJ. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science. 4(1).

Random Composites

Five random pairs of composites from a sample of 20 faces (10 in each composite). Can you spot any differences?

Thank You!



Alper, Sinan, Fatih Bayrak, and Onurcan Yilmaz. 2021. “All the Dark Triad and Some of the Big Five Traits Are Visible in the Face.” Personality and Individual Differences 168: 110350.
Burton, A Mike, Rob Jenkins, Peter JB Hancock, and David White. 2005. “Robust Representations for Face Recognition: The Power of Averages.” Cognitive Psychology 51 (3): 256–84.
DeBruine, Lisa M. 2023. “The Composite Method Produces High False Positive Rates.” PsyArXiv.
DeBruine, Lisa M., and Dale J. Barr. 2021. “Understanding Mixed-Effects Models Through Data Simulation.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920965119.
DeBruine, Lisa M, Iris J Holzleitner, Bernard Tiddeman, and Benedict C Jones. 2022. “Reproducible Methods for Face Research.” PsyArXiv.
Gonzalez, Rafael C, Richard E Woods, et al. 2002. “Digital Image Processing.” Prentice Hall Upper Saddle River, NJ.
Holtzman, Nicholas S. 2011a. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.
———. 2011b. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.
Little, Anthony C., Benedict C. Jones, Lisa M. DeBruine, and Robin I. M. Dunbar. 2013. “Accuracy in Discrimination of Self-Reported Cooperators Using Static Facial Information.” Personality and Individual Differences 54: 507–12.
Paluszek, Michael, and Stephanie Thomas. 2019. “Pattern Recognition with Deep Learning.” In MATLAB Machine Learning Recipes, 209–30. Springer.
Rhodes, Gillian, Sakiko Yoshikawa, Alison Clark, Kieran Lee, Ryan McKay, and Shigeru Akamatsu. 2001. “Attractiveness of Facial Averageness and Symmetry in Non-Western Cultures: In Search of Biologically Based Standards of Beauty.” Perception 30 (5): 611–25.
Sforza, Anna, Ilaria Bufalari, Patrick Haggard, and Salvatore M Aglioti. 2010. “My Face in Yours: Visuo-Tactile Facial Stimulation Influences Sense of Identity.” Social Neuroscience 5 (2): 148–62.
Todorov, Alexander, Chris P Said, Andrew D Engell, and Nikolaas N Oosterhof. 2008. “Understanding Evaluation of Faces on Social Dimensions.” Trends in Cognitive Sciences 12 (12): 455–60.
Visconti di Oleggio Castello, Matteo, J Swaroop Guntupalli, Hua Yang, and M Ida Gobbini. 2014. “Facilitated Detection of Social Cues Conveyed by Familiar Faces.” Frontiers in Human Neuroscience 8: 678.