Replicability and Generalisability in Face Research

debruine.github.io/talks/rep-gen-faces/

Lisa DeBruine
tech.lgbt/@debruine

Abstract

In this talk, I will discuss several initiatives to increase the replicability and generalisability of research on faces, with a special focus on big team science efforts, such as the Psychological Science Accelerator and ManyFaces. I will also make an argument for reproducible stimulus construction and introduce webmorphR, an R package for reproducibly scripting face stimulus creation. Additionally, I will explain how a common methodology in face research, the composite method, produces very high false positive rates, and explain alternatives to this, including the use of mixed effects models for analysing individual face ratings.

Psychological Science Accelerator

Jones, B.C., DeBruine, L.M., Flake, J.K. et al. (2021). To which world regions does the valence–dominance model of social perception apply?. Nature Human Behaviour 5, 159–169. https://doi.org/10.1038/s41562-020-01007-2

Which face looks more trustworthy?

Which face looks more trustworthy?

Which face looks more responsible?

Which face looks more responsible?

Which face looks more dominant?

Which face looks more dominant?

Which face looks more intelligent?

Which face looks more intelligent?

Which face looks more caring?

Which face looks more caring?

Face Ratings

  1. Attractive
  2. Weird
  3. Mean
  4. Trustworthy
  5. Aggressive
  6. Caring
  7. Emotionally stable
  1. Unhappy
  2. Responsible
  3. Sociable
  4. Dominant
  5. Confident
  6. Intelligent

Todorov et al. (2008)

Valence-Dominance Model

How sociable (i.e., friendly or agreeable in company; companionable) is this person?
not at all
very

Hoe sociabel (d.w.z. vriendelijk of prettig in de omgang, gezellig) is deze persoon?
helemaal niet
helemaal erg

Study Stats

  • >3M data points
  • 12,660 participants
  • 11,570 post-exclusion
  • 126 labs
  • 44 countries
  • 25 languages
  • 243 authors

Team

Ben Jones

Lisa DeBruine

Jess Flake

Patrick Forscher

Nicholas Coles

Chris Chartier

CRediT

PCA Loadings

Original Data

Western Europe

PCA Loadings

Principal Components Analysis shows little regional variability

EFA Loadings

Exploratory Factor Analysis shows more regional variability

Secondary Data Challenge

  • Examining the “attractiveness halo effect” - Carlotta Batres, Victor Shiramizu (Current Psychology)
  • Region- and Language-Level ICCs for Judgments of Faces - Neil Hester and Eric Hehman (Psychological Science)
  • Variance & Homogeneity of Facial Trait Space Across World Regions - Sally Xie and Eric Hehman (Psychological Science)
  • The Facial Width-to-Height Ratio (fWHR) and Perceived Dominance and Trustworthiness: Moderating Role of Social Identity Cues (Gender and Race) and Ecological Factor (Pathogen Prevalence) - Subramanya Prasad Chandrashekar
  • Is facial width-to-height ratio reliably associated with social inferences? A large cross-national examination - Patrick Durkee and Jessica Ayers
  • Population diversity is associated with trustworthiness impressions from faces - Jared Martin, Adrienne Wood, and DongWon Oh (Psychological Science)
  • Do regional gender and racial biases predict gender and racial biases in social face judgments? - DongWon Oh and Alexander Todorov
  • Hierarchical Modelling of Facial Perceptions: A Secondary Analysis of Aggressiveness Ratings - Mark Adkins, Nataly Beribisky, Stefan Bonfield, and Linda Farmus

Blog Post

ManyFaces

https://manyfaces.team

ManyFaces is a recently formed big team science group for face perception and face recognition research.

Broadly, the aim of ManyFaces is to improve, diversify, and crowdsource key aspects of face research, including perception and recognition. This involves, for example, the collection and use of face stimuli; sharing existing stimulus sets; standardising stimulus collection procedures; and organising stimulus collection across multiple labs to obtain larger and more diverse face stimulus sets. ManyFaces also aims to crowdsource data collection across our members’ labs to test key research questions in face perception and recognition, enabling larger-scale designs and more diverse participant samples and generalisable findings. Finally, we aim to organise training workshops for key methods (e.g., morphing) and analyses (e.g., mixed effects models) used in face research.

Stimulus Meta-Database

https://osf.io/mbqt3/

The stimulus meta-database working group has compiled a guide to face stimulus meta-databases and resource lists. Various researchers have created lists or meta-databases documenting the broad variety of face stimulus sets that are available for research use. However, these lists vary in how comprehensive they are and in the type of information they provide about each stimulus set. Our guide therefore provides an overview of the most useful of these lists, noting key information such as the kinds of stimuli included in each list, the information provided about each stimulus set, the user friendliness of the list, and the degree of overlap among lists. This guide should aid researchers in finding the most appropriate stimuli for their research and is now publicly available on the Open Science Framework https://osf.io/mbqt3/. This working group is also currently surveying ManyFaces members about any face stimulus sets they have and are willing to share directly with other researchers, with the aim of compiling a guide to stimulus sets that cannot be found via existing lists and databases.

Stimulus Collection

Face image sets tend to suffer from one or more of:

  1. a lack of age and ethnic diversity
  2. insufficient diversity of poses or expressions
  3. a lack of standardisation (e.g., different lighting, backgrounds, camera-to-head distance, and other photographic properties) that makes it impossible to combine image sets
  4. restricted ability to share
  5. unethical procurement

Pilot image collection Protocol refinement Image collection Image processing Perception tests

Protocol Development

github.com/ManyFacesTeam/protocol-dev

Kit (~£800 total)

📷 camera: Canon EOS 250D Digital SLR Camera with 18-55mm IS STM Lens (£649)

💾 memory card: SanDisk 32GB SDHC Card (£9)

🌈 color checker: Calibrite ColorChecker Classic ~A4 (£66)

💡 stand/light: Fovitec Bi-Colour LED Ring Light Kit (£71)

Reproducible Stimuli

DeBruine, L. M., Holzleitner, I. J., Tiddeman, B., & Jones, B. C. (2022). Reproducible Methods for Face Research. PsyArXiv. https://doi.org/10.31234/osf.io/j2754

Vague Methods

Each of the images was rendered in gray-scale and morphed to a common shape using an in-house program based on bi-linear interpolation (see e.g., Gonzalez & Woods, 2002). Key points in the morphing grid were set manually, using a graphics program to align a standard grid to a set of facial points (eye corners, face outline, etc.). Images were then subject to automatic histogram equalization. (Burton et al. 2005, 263)

The reference to Gonzalez, Woods, et al. (2002) is a 190-page textbook. It mentions bilinear interpolation on pages 64–66 in the context of calculating pixel color when resizing images and it’s unclear how this could be used to morph shape.

Photoshop

These pictures were edited using Adobe Photoshop 6.0 to remove external features (hair, ears) and create a uniform grey background. (Sforza et al. 2010, 150)

The averaged composites and blends were sharpened in Adobe Photoshop to reduce any blurring introduced by blending. (Rhodes et al. 2001, 615)

Scriptable Methods

The average pixel intensity of each image (ranging from 0 to 255) was set to 128 with a standard deviation of 40 using the SHINE toolbox (function lumMatch) (Willenbockel et al., 2010) in MATLAB (version 8.1.0.604, R2013a). (Visconti di Oleggio Castello et al. 2014, 2)

We used the GraphicConverterTM application to crop the images around the cat face and make them all 1024x1024 pixels. One of the challenges of image matching is to do this process automatically. (Paluszek and Thomas 2019, 214)

Commerical morphing

The faces were carefully marked with 112 nodes in FantaMorph™, 4th version: 28 nodes (face outline), 16 (nose), 5 (each ear), 20 (lips), 11 (each eye), and 8 (each eyebrow). To create the prototypes, I used FantaMorph Face Mixer, which averages node locations across faces. Prototypes are available online, in the Personality Faceaurus [http://www.nickholtzman.com/faceaurus.htm]. (Holtzman 2011a, 650)

WebMorphR

https://debruine.github.io/webmorphR/

orig <- demo_stim() # load demo images
mirrored <- mirror(orig)
cropped  <- crop(orig, width = 0.75, height = 0.75)
resized  <- resize(orig, 0.75)
rotated  <- rotate(orig, degrees = c(90, 180))
padded   <- pad(orig, 30, fill = c("hotpink", "dodgerblue"))
grey     <- greyscale(orig)

Templates

Masking

demo_stim() |> mask(fill = "black")

Custom Mask

demo_stim() |>  mask(mask = c("eyes", "mouth"), 
                     fill = "#00000099", 
                     reverse = TRUE)

“Standard” Oval Mask

demo_stim() |> 
  greyscale() |>
  subset_tem(features("face")) |> # ignore hair, neck and ears
  crop_tem(50) |>                 # crop to 50px around template
  mask_oval(fill = "grey40")

Alignment

faces <- load_stim_neutral(22:26) 
aligned <- faces |> align(fill = "dodgerblue")

c(faces, aligned) |> plot(nrow = 2)

Images are aligned by default to the average x- and y-coordinates of the two alignment points, but you can specify the coordinates and output width and height manually or from a reference image. You can also specify 1-point alignment, which does not resize or rotate the images. Procrustes alignment is available on platforms with OpenGL.

Alignment with Patch Fill

faces |> align(fill = patch(faces))

Composites

neu_orig <- load_stim_neutral() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5) 

smi_orig <- load_stim_smiling() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5)

all <- c(neu_orig, smi_orig) |>
  auto_delin("dlib70", replace = TRUE)

aligned <- all |>
  align(procrustes = TRUE, fill = patch(all)) |>
  crop(.6, .8, y_off = 0.05)

neu_avg <- subset(aligned, 1:5) |> avg(texture = FALSE)
smi_avg <- subset(aligned, 6:10) |> avg(texture = FALSE)

Composites

Continuum

steps <- continuum(
  from_img = neu_avg, 
  to_img = smi_avg, 
  from = -0.5, 
  to = 1.5, 
  by = 0.25
)

Word Stimuli

# make a vector of the words and colours they should print in
colours <- c(red = "red3", 
             green = "darkgreen", 
             blue = "dodgerblue3")

# make vector of labels (each word in each colour)
labels <- names(colours) |> rep(each = 3)

# make blank 800x200px images and add labels
stroop <- blank(3*3, 800, 200) |>
  label(labels, 
        size = 100, 
        color = colours, 
        weight = 700,
        gravity = "center")

Face Composites

DeBruine, L. M. (2023). The Composite Method Produces High False Positive Rates. PsyArXiv. https://doi.org/10.31234/osf.io/htrg9

Composites

People chose the composite of people who self-reported a high probability to cooperate in a prisoners’ dilemma as more likely to cooperate about 62% of the time (Little et al. 2013)

Women’s Height

n <- 10
mean <- 162
sd <- 7

# for reproducible simulation
set.seed(42) 

odd  <- rnorm(n, mean, sd)
even <- rnorm(n, mean, sd)

A t-test shows no significant difference (\(t_{13.42}\) = 1.23, \(p\) = .121, \(d\) = 0.55), which is unsurprising. We simulated the data from the same distribution, so we know for sure there is no real difference here.

Composite Height

Odd Composite Even Composite
165.8 cm 160.9 cm
🧍‍♀️ 🧍‍♀️

Now we’re going to average the height of the women with odd and even birthdays. So if we create a full-body composite of women born on odd days, she would be 165.8 cm tall, and a composite of women born on even days would be 160.9 cm tall.

We know this difference is entirely due to chance, but if we ask raters to look at these two composites, side-by-side, and judge which one looks taller, what do you imagine would happen? It’s likely all of them would judge the odd-birthday composite as taller. You only need 5 raters for statistical significance (with alpha = 0.05) on an exact binomial test.

Rating Height

rater_n <- 50 # number of raters
error_sd <- 10 # rater error

# for reproducible simulation
set.seed(1) 

# add the error to the composite mean heights
odd_est  <- mean(odd) + 
  rnorm(rater_n, 0, error_sd)
even_est <- mean(even) + 
  rnorm(rater_n, 0, error_sd)

Now the women with odd birthdays are significantly taller than the women with even birthdays (\(t_{49}\) = 2.61, \(p\) = .006, \(d\) = 0.53)!

But let’s say that raters have to judge the composites independently, and they are pretty bad with height estimation, so their estimates for each composite have error with a standard deviation of 10 cm. We can simulate such ratings from 50 raters and then compare the estimates for the odd-birthday composite with the estimates for the even-birthday composite.

What changed? Essentially, we’re no longer testing whether women born on odd days are taller than those born on even days, but whether raters can perceive the chance difference in height between the pair of composites. As long as there is any difference between the composites that exceeds the perceptual threshold for detection, we can find a significant result with enough raters.

The effect has a 50% chance of being in the predicted direction, and whatever result we find with this pair is likely to be highly replicable in a new set of raters rating the same pair.

Is this a fluke?

Maybe this is just a fluke of the original sample? We can repeat the procedure 10000 times and check the p-values of the individual analysis versus the composite method. We can see that the individual method has the expected uniform distribution of p-values, as there is no difference between the two groups. The proportion of false positives is 4.97%, which is close to the alpha criterion of 0.05. However, the composite method produced a false positive rate of 18.4% with a directional hypothesis, and 28.1% with a non-directional hypothesis. And as we’ll see later, you can increase the false positive rate to near 50% for directional hypotheses and 100% for non-directional hypotheses by increasing the number of raters.

Individual versus composite method. The individual method shows the expected uniform distribution of p-values, while the composite method has an inflated false positive rate.

Dark Triad

A recent paper by Alper, Bayrak, and Yilmaz (2021) used faces from the Faceaurus database (Holtzman 2011b) to test whether dark triad personality traits (Machiavellianism, narcissism, and psychopathy) are visible in the face. “Holtzman (2011) standardized the assessment scores, computed average scores of self- and peer-reports, and ranked the face images based on the resulting scores. Then, prototypes for each of the personality dimensions were created by digitally combining 10 faces with the highest, and 10 faces with the lowest scores on the personality trait in question (Holtzman, 2011).” This was done separately for male and female faces.

Holtzman (2011a), replicated by Alper, Bayrak, and Yilmaz (2021)

Simulate Composite Ratings

Following Holtzman (2011a), we simulated 100 sets of 6 “image pairs” with no actual difference in appearance, and 105 raters giving -5 to +5 ratings for which face in each pair looks more Machiavellian, narcissistic, or psychopathic. By chance alone, some of the values will be significant in the predicted direction.

More Raters?

A naive solution to this problem is to increase the number of raters, which should produce more accurate results, right? Actually, this makes the problem even worse. As you increase the number of raters, the power to detect even small (chance) differences in composites rises (Figure 8). Consequently, you can virtually guarantee significant results, even for tiny differences or traits that people are very bad at estimating.

Composite Size

With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned difference between composites from populations with no real difference is 0.31 SD.

How likely is it that there will be chance differences in the composites big enough to be a problem? More likely than you probably think, especially when there are a small number of stimuli in each composite. The smaller the number of stimuli that go into each composite, the larger the median (unsigned) size of this difference (Figure 9). With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned effect size of the difference between composites from populations with no real difference is 0.31 (in units of SD of the original trait distribution). If our raters are accurate enough at perceiving this difference, or we run a very large number of raters, we are virtually guaranteed to find significant results every time. There is a 50% chance that these results will be in the predicted direction, and this direction will be replicable across different samples of raters for the same image set.

Mixed Effects Models

DeBruine LM, Barr DJ. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science. 4(1). https://doi.org/10.1177/2515245920965119

Random Composites

Five random pairs of composites from a sample of 20 faces (10 in each composite). Can you spot any differences?

Thank You!

debruine.github.io/talks/rep-gen-faces/

code

tech.lgbt/@debruine

References

Alper, Sinan, Fatih Bayrak, and Onurcan Yilmaz. 2021. “All the Dark Triad and Some of the Big Five Traits Are Visible in the Face.” Personality and Individual Differences 168: 110350. https://doi.org/https://doi.org/10.1016/j.paid.2020.110350.
Burton, A Mike, Rob Jenkins, Peter JB Hancock, and David White. 2005. “Robust Representations for Face Recognition: The Power of Averages.” Cognitive Psychology 51 (3): 256–84.
DeBruine, Lisa M. 2023. “The Composite Method Produces High False Positive Rates.” PsyArXiv. https://doi.org/10.31234/osf.io/htrg9.
DeBruine, Lisa M., and Dale J. Barr. 2021. “Understanding Mixed-Effects Models Through Data Simulation.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920965119.
DeBruine, Lisa M, Iris J Holzleitner, Bernard Tiddeman, and Benedict C Jones. 2022. “Reproducible Methods for Face Research.” PsyArXiv. https://doi.org/10.31234/osf.io/j2754.
Gonzalez, Rafael C, Richard E Woods, et al. 2002. “Digital Image Processing.” Prentice Hall Upper Saddle River, NJ. https://www.pearson.com/us/higher-education/product/Gonzalez-Digital-Image-Processing-2nd-Edition/9780201180756.html.
Holtzman, Nicholas S. 2011a. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.
———. 2011b. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.
Little, Anthony C., Benedict C. Jones, Lisa M. DeBruine, and Robin I. M. Dunbar. 2013. “Accuracy in Discrimination of Self-Reported Cooperators Using Static Facial Information.” Personality and Individual Differences 54: 507–12. https://doi.org/10.1016/j.paid.2012.10.018.
Paluszek, Michael, and Stephanie Thomas. 2019. “Pattern Recognition with Deep Learning.” In MATLAB Machine Learning Recipes, 209–30. Springer.
Rhodes, Gillian, Sakiko Yoshikawa, Alison Clark, Kieran Lee, Ryan McKay, and Shigeru Akamatsu. 2001. “Attractiveness of Facial Averageness and Symmetry in Non-Western Cultures: In Search of Biologically Based Standards of Beauty.” Perception 30 (5): 611–25. https://doi.org/10.1068/p3123.
Sforza, Anna, Ilaria Bufalari, Patrick Haggard, and Salvatore M Aglioti. 2010. “My Face in Yours: Visuo-Tactile Facial Stimulation Influences Sense of Identity.” Social Neuroscience 5 (2): 148–62.
Todorov, Alexander, Chris P Said, Andrew D Engell, and Nikolaas N Oosterhof. 2008. “Understanding Evaluation of Faces on Social Dimensions.” Trends in Cognitive Sciences 12 (12): 455–60.
Visconti di Oleggio Castello, Matteo, J Swaroop Guntupalli, Hua Yang, and M Ida Gobbini. 2014. “Facilitated Detection of Social Cues Conveyed by Familiar Faces.” Frontiers in Human Neuroscience 8: 678.
Replicability and Generalisability in Face Research debruine.github.io/talks/rep-gen-faces/ Lisa DeBruine tech.lgbt/@debruine

  1. Slides

  2. Tools

  3. Close
  • Replicability and Generalisability in Face Research
  • Abstract
  • Psychological Science Accelerator
  • Which face looks more trustworthy?
  • Which face looks more trustworthy?
  • Which face looks more responsible?
  • Which face looks more responsible?
  • Which face looks more dominant?
  • Which face looks more dominant?
  • Which face looks more intelligent?
  • Which face looks more intelligent?
  • Which face looks more caring?
  • Which face looks more caring?
  • Face Ratings
  • Valence-Dominance Model
  • How sociable (i.e.,...
  • Hoe sociabel (d.w.z....
  • Study Stats
  • Team
  • CRediT
  • PCA Loadings
  • PCA Loadings
  • EFA Loadings
  • Secondary Data Challenge
  • ManyFaces
  • Stimulus Meta-Database
  • Stimulus Collection
  • Protocol Development
  • Kit (~£800 total)
  • Reproducible Stimuli
  • Vague Methods
  • Photoshop
  • Scriptable Methods
  • Commerical morphing
  • WebMorphR
  • Templates
  • Masking
  • Custom Mask
  • “Standard” Oval Mask
  • Alignment
  • Alignment with Patch Fill
  • Composites
  • Composites
  • Continuum
  • Word Stimuli
  • Face Composites
  • Composites
  • Women’s Height
  • Composite Height
  • Rating Height
  • Is this a fluke?
  • Dark Triad
  • Simulate Composite Ratings
  • More Raters?
  • Composite Size
  • Mixed Effects Models
  • Random Composites
  • Thank You!
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help