Fake It Until You Make It

How and why to simulate research data

Lisa DeBruine

Abstract

debruine.github.io/talks/EMPSEB-fake-it-2023/

Being able to simulate data allows you to prep analysis scripts for pre-registration, calculate power and sensitivity for analyses that don’t have empirical methods, create reproducible examples when your data are too big or confidential to share, enhance your understanding of statistical concepts, and create demo data for teaching and tutorials. This workshop will cover the basics of simulation using the R package {faux}. We will simulate data with factorial designs by specifying the within and between-subjects factor structure, each cell mean and standard deviation, and correlations between cells where appropriate. This can be used to create simulated data sets to be used in preparing the analysis code for pre-registrations or registered reports. We will also create data sets for simulation-based power analyses.

Why Simulate Data?

Pre-Registration

Prep analysis scripts for pre-registration

Power

Calculate power and sensitivity for analyses that don’t have empirical methods

Reproducible Examples

Create reproducible examples when your data are too big or confidential to share

Enhance Understanding

Enhance your understanding of statistical concepts

Teaching Data

Create demo data for teaching and tutorials

Faux

rstudio-connect.psy.gla.ac.uk/faux/

rstudio-connect.psy.gla.ac.uk/faux/

rstudio-connect.psy.gla.ac.uk/faux/

Faux Code

sim_data <- faux::sim_design(
  within = list(version = c(V1 = "Version 1", V2 = "Version 2"), 
                condition = c(ctl = "Control", exp = "Experimental")),
  between = list(age_group = c(young = "Age 20-29", old = "Age 70-79")),
  n = 30,
  mu = c(100, 100, 100, 100, 100, 90, 110, 110),
  sd = 20,
  r = 0.5,
  dv = c(score = "Score"),
  id = c(id = "Subject ID"),
  vardesc = list(version = "Task Version", 
                 condition = "Experiment Condition", 
                 age_group = "Age Group"),
  long = TRUE
)

Faux Design Parameters

age_group version condition V1_ctl V1_exp V2_ctl V2_exp n mu sd
young V1 ctl 1.0 0.5 0.5 0.5 30 100 20
young V1 exp 0.5 1.0 0.5 0.5 30 100 20
young V2 ctl 0.5 0.5 1.0 0.5 30 100 20
young V2 exp 0.5 0.5 0.5 1.0 30 100 20
old V1 ctl 1.0 0.5 0.5 0.5 30 100 20
old V1 exp 0.5 1.0 0.5 0.5 30 90 20
old V2 ctl 0.5 0.5 1.0 0.5 30 110 20
old V2 exp 0.5 0.5 0.5 1.0 30 110 20

Faux Design Plot

Faux Data Plot

Power Simulation: Replicate Data

sim_data <- faux::sim_design(
  within = list(version = c(V1 = "Version 1", V2 = "Version 2"), 
                condition = c(ctl = "Control", exp = "Experimental")),
  between = list(age_group = c(young = "Age 20-29", old = "Age 70-79")),
  n = 30,
  mu = c(100, 100, 100, 100, 100, 90, 110, 110),
  sd = 20,
  r = 0.5,
  dv = c(score = "Score"),
  id = c(id = "Subject ID"),
  vardesc = list(version = "Task Version", 
                 condition = "Experiment Condition", 
                 age_group = "Age Group"),
  long = TRUE,
  rep = 100
)

Power Simulation: Analysis Function

# setup options to avoid annoying afex message & run faster
afex::set_sum_contrasts()
afex::afex_options(include_aov = FALSE) 

analysis <- function(data) {
  a <- afex::aov_ez(
    id = "id", 
    dv = "score", 
    between = "age_group",
    within = c("version", "condition"),
    data = data)
  
  as_tibble(a$anova_table, rownames = "term") |>
    rename(p = `Pr(>F)`)
}

Power Simulation: Analysis Result

# test on first data set
analysis(sim_data$data[[1]])
term num Df den Df MSE F ges p
age_group 1 58 1,209.8 0.02 0.000 0.891
version 1 58 249.6 20.28 0.047 0.000
age_group:version 1 58 249.6 12.23 0.029 0.001
condition 1 58 174.7 5.69 0.009 0.020
age_group:condition 1 58 174.7 6.93 0.012 0.011
version:condition 1 58 154.8 0.97 0.001 0.328
age_group:version:condition 1 58 154.8 0.60 0.001 0.440

Power Simulation

power <- sim_data |>
  mutate(analysis = purrr::map(data, analysis)) |>
  select(-data) |>
  unnest(analysis) |>
  group_by(term) |>
  summarise(power = mean(p < .05))
term power
age_group 0.11
age_group:condition 0.21
age_group:version 1.00
age_group:version:condition 0.25
condition 0.32
version 0.98
version:condition 0.32

Further Resources

PsyPag Simulation Summer School

Data Simulation Workshops

Thank You!

debruine.github.io/talks/EMPSEB-fake-it-2023/

Workshop Materials: tinyurl.com/data-sim

Prerequisites: Students will need to have very basic knowledge of R and familiarity with R Markdown, and have installed R and RStudio on their laptop and installed the packages {faux}, {afex}, {broom} and {tidyverse} from CRAN.