Replicability and Generalisability in Face Research

debruine.github.io/talks/rep-gen-faces/

Lisa DeBruine

Abstract

In this talk, I will discuss several initiatives to increase the replicability and generalisability of research on faces, with a special focus on big team science efforts, such as the Psychological Science Accelerator and ManyFaces. I will also make an argument for reproducible stimulus construction and introduce webmorphR, an R package for reproducibly scripting face stimulus creation. Additionally, I will explain how a common methodology in face research, the composite method, produces very high false positive rates, and explain alternatives to this, including the use of mixed effects models for analysing individual face ratings.

Psychological Science Accelerator

Jones, B.C., DeBruine, L.M., Flake, J.K. et al. (2021). To which world regions does the valence–dominance model of social perception apply?. Nature Human Behaviour 5, 159–169. https://doi.org/10.1038/s41562-020-01007-2

Which face looks more trustworthy?

Which face looks more responsible?

Which face looks more dominant?

Which face looks more intelligent?

Which face looks more caring?

Face Ratings

Attractive
Weird
Mean
Trustworthy
Aggressive
Caring
Emotionally stable

Unhappy
Responsible
Sociable
Dominant
Confident
Intelligent

Todorov et al. (2008)

Valence-Dominance Model

How sociable (i.e., friendly or agreeable in company; companionable) is this person?

not at all

very

Hoe sociabel (d.w.z. vriendelijk of prettig in de omgang, gezellig) is deze persoon?

helemaal niet

helemaal erg

Study Stats

>3M data points
12,660 participants
11,570 post-exclusion
126 labs
44 countries
25 languages
243 authors

Team

CRediT

PCA Loadings

PCA Loadings

Principal Components Analysis shows little regional variability

EFA Loadings

Exploratory Factor Analysis shows more regional variability

Secondary Data Challenge

Examining the “attractiveness halo effect” - Carlotta Batres, Victor Shiramizu (Current Psychology)
Region- and Language-Level ICCs for Judgments of Faces - Neil Hester and Eric Hehman (Psychological Science)
Variance & Homogeneity of Facial Trait Space Across World Regions - Sally Xie and Eric Hehman (Psychological Science)
The Facial Width-to-Height Ratio (fWHR) and Perceived Dominance and Trustworthiness: Moderating Role of Social Identity Cues (Gender and Race) and Ecological Factor (Pathogen Prevalence) - Subramanya Prasad Chandrashekar

Is facial width-to-height ratio reliably associated with social inferences? A large cross-national examination - Patrick Durkee and Jessica Ayers
Population diversity is associated with trustworthiness impressions from faces - Jared Martin, Adrienne Wood, and DongWon Oh (Psychological Science)
Do regional gender and racial biases predict gender and racial biases in social face judgments? - DongWon Oh and Alexander Todorov
Hierarchical Modelling of Facial Perceptions: A Secondary Analysis of Aggressiveness Ratings - Mark Adkins, Nataly Beribisky, Stefan Bonfield, and Linda Farmus

Blog Post

ManyFaces

https://manyfaces.team

ManyFaces is a recently formed big team science group for face perception and face recognition research.

Stimulus Meta-Database

https://osf.io/mbqt3/

The stimulus meta-database working group has compiled a guide to face stimulus meta-databases and resource lists. Various researchers have created lists or meta-databases documenting the broad variety of face stimulus sets that are available for research use. However, these lists vary in how comprehensive they are and in the type of information they provide about each stimulus set. Our guide therefore provides an overview of the most useful of these lists, noting key information such as the kinds of stimuli included in each list, the information provided about each stimulus set, the user friendliness of the list, and the degree of overlap among lists. This guide should aid researchers in finding the most appropriate stimuli for their research and is now publicly available on the Open Science Framework https://osf.io/mbqt3/. This working group is also currently surveying ManyFaces members about any face stimulus sets they have and are willing to share directly with other researchers, with the aim of compiling a guide to stimulus sets that cannot be found via existing lists and databases.

Stimulus Collection

Face image sets tend to suffer from one or more of:

a lack of age and ethnic diversity
insufficient diversity of poses or expressions
a lack of standardisation (e.g., different lighting, backgrounds, camera-to-head distance, and other photographic properties) that makes it impossible to combine image sets
restricted ability to share
unethical procurement

Protocol Development

github.com/ManyFacesTeam/protocol-dev

Kit (~£800 total)

📷 camera: Canon EOS 250D Digital SLR Camera with 18-55mm IS STM Lens (£649)

💾 memory card: SanDisk 32GB SDHC Card (£9)

🌈 color checker: Calibrite ColorChecker Classic ~A4 (£66)

💡 stand/light: Fovitec Bi-Colour LED Ring Light Kit (£71)

Reproducible Stimuli

DeBruine, L. M., Holzleitner, I. J., Tiddeman, B., & Jones, B. C. (2022). Reproducible Methods for Face Research. PsyArXiv. https://doi.org/10.31234/osf.io/j2754

Vague Methods

Each of the images was rendered in gray-scale and morphed to a common shape using an in-house program based on bi-linear interpolation (see e.g., Gonzalez & Woods, 2002). Key points in the morphing grid were set manually, using a graphics program to align a standard grid to a set of facial points (eye corners, face outline, etc.). Images were then subject to automatic histogram equalization. (Burton et al. 2005, 263)

Photoshop

These pictures were edited using Adobe Photoshop 6.0 to remove external features (hair, ears) and create a uniform grey background. (Sforza et al. 2010, 150)

The averaged composites and blends were sharpened in Adobe Photoshop to reduce any blurring introduced by blending. (Rhodes et al. 2001, 615)

Scriptable Methods

The average pixel intensity of each image (ranging from 0 to 255) was set to 128 with a standard deviation of 40 using the SHINE toolbox (function lumMatch) (Willenbockel et al., 2010) in MATLAB (version 8.1.0.604, R2013a). (Visconti di Oleggio Castello et al. 2014, 2)

We used the GraphicConverterTM application to crop the images around the cat face and make them all 1024x1024 pixels. One of the challenges of image matching is to do this process automatically. (Paluszek and Thomas 2019, 214)

Commerical morphing

The faces were carefully marked with 112 nodes in FantaMorph™, 4th version: 28 nodes (face outline), 16 (nose), 5 (each ear), 20 (lips), 11 (each eye), and 8 (each eyebrow). To create the prototypes, I used FantaMorph Face Mixer, which averages node locations across faces. Prototypes are available online, in the Personality Faceaurus [http://www.nickholtzman.com/faceaurus.htm]. (Holtzman 2011a, 650)

WebMorphR

https://debruine.github.io/webmorphR/

orig <- demo_stim() # load demo images
mirrored <- mirror(orig)
cropped  <- crop(orig, width = 0.75, height = 0.75)
resized  <- resize(orig, 0.75)
rotated  <- rotate(orig, degrees = c(90, 180))
padded   <- pad(orig, 30, fill = c("hotpink", "dodgerblue"))
grey     <- greyscale(orig)

Templates

Masking

demo_stim() |> mask(fill = "black")

Custom Mask

demo_stim() |>  mask(mask = c("eyes", "mouth"), 
                     fill = "#00000099", 
                     reverse = TRUE)

“Standard” Oval Mask

demo_stim() |> 
  greyscale() |>
  subset_tem(features("face")) |> # ignore hair, neck and ears
  crop_tem(50) |>                 # crop to 50px around template
  mask_oval(fill = "grey40")

Alignment

faces <- load_stim_neutral(22:26) 
aligned <- faces |> align(fill = "dodgerblue")

c(faces, aligned) |> plot(nrow = 2)

Alignment with Patch Fill

faces |> align(fill = patch(faces))

Composites

neu_orig <- load_stim_neutral() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5) 

smi_orig <- load_stim_smiling() |>
  add_info(webmorphR.stim::london_info) |>
  subset(face_gender == "female") |> 
  subset(face_eth == "black") |> subset(1:5)

all <- c(neu_orig, smi_orig) |>
  auto_delin("dlib70", replace = TRUE)

aligned <- all |>
  align(procrustes = TRUE, fill = patch(all)) |>
  crop(.6, .8, y_off = 0.05)

neu_avg <- subset(aligned, 1:5) |> avg(texture = FALSE)
smi_avg <- subset(aligned, 6:10) |> avg(texture = FALSE)

Composites

Continuum

steps <- continuum(
  from_img = neu_avg, 
  to_img = smi_avg, 
  from = -0.5, 
  to = 1.5, 
  by = 0.25
)

Word Stimuli

# make a vector of the words and colours they should print in
colours <- c(red = "red3", 
             green = "darkgreen", 
             blue = "dodgerblue3")

# make vector of labels (each word in each colour)
labels <- names(colours) |> rep(each = 3)

# make blank 800x200px images and add labels
stroop <- blank(3*3, 800, 200) |>
  label(labels, 
        size = 100, 
        color = colours, 
        weight = 700,
        gravity = "center")

Face Composites

DeBruine, L. M. (2023). The Composite Method Produces High False Positive Rates. PsyArXiv. https://doi.org/10.31234/osf.io/htrg9

Composites

People chose the composite of people who self-reported a high probability to cooperate in a prisoners’ dilemma as more likely to cooperate about 62% of the time (Little et al. 2013)

Women’s Height

n <- 10
mean <- 162
sd <- 7

# for reproducible simulation
set.seed(42) 

odd  <- rnorm(n, mean, sd)
even <- rnorm(n, mean, sd)

A t-test shows no significant difference (\(t_{13.42}\) = 1.23, \(p\) = .121, \(d\) = 0.55), which is unsurprising. We simulated the data from the same distribution, so we know for sure there is no real difference here.

Composite Height

Odd Composite	Even Composite
165.8 cm	160.9 cm
🧍‍♀️	🧍‍♀️

Rating Height

rater_n <- 50 # number of raters
error_sd <- 10 # rater error

# for reproducible simulation
set.seed(1) 

# add the error to the composite mean heights
odd_est  <- mean(odd) + 
  rnorm(rater_n, 0, error_sd)
even_est <- mean(even) + 
  rnorm(rater_n, 0, error_sd)

Now the women with odd birthdays are significantly taller than the women with even birthdays (\(t_{49}\) = 2.61, \(p\) = .006, \(d\) = 0.53)!

But let’s say that raters have to judge the composites independently, and they are pretty bad with height estimation, so their estimates for each composite have error with a standard deviation of 10 cm. We can simulate such ratings from 50 raters and then compare the estimates for the odd-birthday composite with the estimates for the even-birthday composite.

What changed? Essentially, we’re no longer testing whether women born on odd days are taller than those born on even days, but whether raters can perceive the chance difference in height between the pair of composites. As long as there is any difference between the composites that exceeds the perceptual threshold for detection, we can find a significant result with enough raters.

The effect has a 50% chance of being in the predicted direction, and whatever result we find with this pair is likely to be highly replicable in a new set of raters rating the same pair.

Is this a fluke?

Individual versus composite method. The individual method shows the expected uniform distribution of p-values, while the composite method has an inflated false positive rate.

Dark Triad

Holtzman (2011a), replicated by Alper, Bayrak, and Yilmaz (2021)

Simulate Composite Ratings

Following Holtzman (2011a), we simulated 100 sets of 6 “image pairs” with no actual difference in appearance, and 105 raters giving -5 to +5 ratings for which face in each pair looks more Machiavellian, narcissistic, or psychopathic. By chance alone, some of the values will be significant in the predicted direction.

More Raters?

Composite Size

With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned difference between composites from populations with no real difference is 0.31 SD.

How likely is it that there will be chance differences in the composites big enough to be a problem? More likely than you probably think, especially when there are a small number of stimuli in each composite. The smaller the number of stimuli that go into each composite, the larger the median (unsigned) size of this difference (Figure 9). With only 10 stimuli per composite (like the Facesaurus composites), the median unsigned effect size of the difference between composites from populations with no real difference is 0.31 (in units of SD of the original trait distribution). If our raters are accurate enough at perceiving this difference, or we run a very large number of raters, we are virtually guaranteed to find significant results every time. There is a 50% chance that these results will be in the predicted direction, and this direction will be replicable across different samples of raters for the same image set.

Mixed Effects Models

DeBruine LM, Barr DJ. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science. 4(1). https://doi.org/10.1177/2515245920965119

Random Composites

Five random pairs of composites from a sample of 20 faces (10 in each composite). Can you spot any differences?

Thank You!

debruine.github.io/talks/rep-gen-faces/

code

tech.lgbt/@debruine

References

Alper, Sinan, Fatih Bayrak, and Onurcan Yilmaz. 2021. “All the Dark Triad and Some of the Big Five Traits Are Visible in the Face.” Personality and Individual Differences 168: 110350. https://doi.org/https://doi.org/10.1016/j.paid.2020.110350.

Burton, A Mike, Rob Jenkins, Peter JB Hancock, and David White. 2005. “Robust Representations for Face Recognition: The Power of Averages.” Cognitive Psychology 51 (3): 256–84.

DeBruine, Lisa M. 2023. “The Composite Method Produces High False Positive Rates.” PsyArXiv. https://doi.org/10.31234/osf.io/htrg9.

DeBruine, Lisa M., and Dale J. Barr. 2021. “Understanding Mixed-Effects Models Through Data Simulation.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920965119.

DeBruine, Lisa M, Iris J Holzleitner, Bernard Tiddeman, and Benedict C Jones. 2022. “Reproducible Methods for Face Research.” PsyArXiv. https://doi.org/10.31234/osf.io/j2754.

Gonzalez, Rafael C, Richard E Woods, et al. 2002. “Digital Image Processing.” Prentice Hall Upper Saddle River, NJ. https://www.pearson.com/us/higher-education/product/Gonzalez-Digital-Image-Processing-2nd-Edition/9780201180756.html.

Holtzman, Nicholas S. 2011a. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.

———. 2011b. “Facing a Psychopath: Detecting the Dark Triad from Emotionally-Neutral Faces, Using Prototypes from the Personality Faceaurus.” Journal of Research in Personality 45 (6): 648–54.

Little, Anthony C., Benedict C. Jones, Lisa M. DeBruine, and Robin I. M. Dunbar. 2013. “Accuracy in Discrimination of Self-Reported Cooperators Using Static Facial Information.” Personality and Individual Differences 54: 507–12. https://doi.org/10.1016/j.paid.2012.10.018.

Paluszek, Michael, and Stephanie Thomas. 2019. “Pattern Recognition with Deep Learning.” In MATLAB Machine Learning Recipes, 209–30. Springer.

Rhodes, Gillian, Sakiko Yoshikawa, Alison Clark, Kieran Lee, Ryan McKay, and Shigeru Akamatsu. 2001. “Attractiveness of Facial Averageness and Symmetry in Non-Western Cultures: In Search of Biologically Based Standards of Beauty.” Perception 30 (5): 611–25. https://doi.org/10.1068/p3123.

Sforza, Anna, Ilaria Bufalari, Patrick Haggard, and Salvatore M Aglioti. 2010. “My Face in Yours: Visuo-Tactile Facial Stimulation Influences Sense of Identity.” Social Neuroscience 5 (2): 148–62.

Todorov, Alexander, Chris P Said, Andrew D Engell, and Nikolaas N Oosterhof. 2008. “Understanding Evaluation of Faces on Social Dimensions.” Trends in Cognitive Sciences 12 (12): 455–60.

Visconti di Oleggio Castello, Matteo, J Swaroop Guntupalli, Hua Yang, and M Ida Gobbini. 2014. “Facilitated Detection of Social Cues Conveyed by Familiar Faces.” Frontiers in Human Neuroscience 8: 678.