Replicability of results

in the context of private non-sharable data

https://debruine.github.io/talks/rostock-datasim/

Lisa DeBruine

Abstract

Being able to simulate data allows you to prep analysis scripts for pre-registration, calculate power and sensitivity for analyses that don’t have empirical methods, create reproducible examples when your data are too big or confidential to share, enhance your understanding of statistical concepts, and create demo data for teaching and tutorials. This talk will cover the basics of simulation using the R package {faux} to simulate data with factorial designs by specifying the within and between-subjects factor structure, each cell mean and standard deviation, and correlations between cells where appropriate. We will also explore the R package {synthpop}, to simulate data from existing datasets.

Why Simulate Data?

Pre-Registration

Prep analysis scripts for pre-registration or registered reports

Power

Calculate power and sensitivity for analyses that don’t have empirical methods

Reproducible Examples

Create reproducible examples when your data are too big or confidential to share

Enhance Understanding

Enhance your understanding of statistical or other complex concepts

Teaching Data

Create demo data for teaching and tutorials

Faux

Shiny app or R package

]

Faux Code

sim_data <- faux::sim_design(
  within = list(version = c(V1 = "Version 1", V2 = "Version 2"), 
                condition = c(ctl = "Control", exp = "Experimental")),
  between = list(age_group = c(young = "Age 20-29", old = "Age 70-79")),
  n = 30,
  mu = c(100, 100, 100, 100, 100, 90, 110, 110),
  sd = 20,
  r = 0.5,
  dv = c(score = "Score"),
  id = c(id = "Subject ID"),
  vardesc = list(version = "Task Version", 
                 condition = "Experiment Condition", 
                 age_group = "Age Group"),
  long = TRUE
)

Faux Design Parameters

age_group	version	condition	V1_ctl	V1_exp	V2_ctl	V2_exp	n	mu	sd
young	V1	ctl	1.0	0.5	0.5	0.5	30	100	20
young	V1	exp	0.5	1.0	0.5	0.5	30	100	20
young	V2	ctl	0.5	0.5	1.0	0.5	30	100	20
young	V2	exp	0.5	0.5	0.5	1.0	30	100	20
old	V1	ctl	1.0	0.5	0.5	0.5	30	100	20
old	V1	exp	0.5	1.0	0.5	0.5	30	90	20
old	V2	ctl	0.5	0.5	1.0	0.5	30	110	20
old	V2	exp	0.5	0.5	0.5	1.0	30	110	20

Faux Design Plot

sim_data |> get_design() |> plot()

Faux Data Plot

sim_data |> plot(geoms = c("violin", "pointrangeSE"))

Power Simulation: Replicate Data

sim_data <- faux::sim_design(
  within = list(version = c(V1 = "Version 1", V2 = "Version 2"), 
                condition = c(ctl = "Control", exp = "Experimental")),
  between = list(age_group = c(young = "Age 20-29", old = "Age 70-79")),
  n = 30,
  mu = c(100, 100, 100, 100, 100, 90, 110, 110),
  sd = 20,
  r = 0.5,
  dv = c(score = "Score"),
  id = c(id = "Subject ID"),
  vardesc = list(version = "Task Version", 
                 condition = "Experiment Condition", 
                 age_group = "Age Group"),
  long = TRUE,
  rep = 100
)

Power Simulation: Analysis Function

# setup options to avoid annoying afex message & run faster
afex::set_sum_contrasts()
afex::afex_options(include_aov = FALSE) 

analysis_func <- function(data) {
  a <- afex::aov_ez(
    id = "id", 
    dv = "score", 
    between = "age_group",
    within = c("version", "condition"),
    data = data)
  
  as_tibble(a$anova_table, rownames = "term") |>
    rename(p = `Pr(>F)`)
}

Power Simulation: Analysis Result

# test on first data set
analysis_func(sim_data$data[[1]])

term	num Df	den Df	MSE	F	ges	p
age_group	1	58	1,686.9	0.00	0.000	0.991
version	1	58	184.0	12.42	0.017	0.001
age_group:version	1	58	184.0	26.69	0.036	0.000
condition	1	58	181.9	4.35	0.006	0.041
age_group:condition	1	58	181.9	1.90	0.003	0.173
version:condition	1	58	194.3	0.02	0.000	0.883
age_group:version:condition	1	58	194.3	0.27	0.000	0.604

Power Simulation

power <- sim_data |>
  mutate(analysis = purrr::map(data, analysis_func)) |>
  select(-data) |>
  unnest(analysis) |>
  group_by(term) |>
  summarise(power = mean(p < .05))

term	power
age_group	0.11
age_group:condition	0.27
age_group:version	0.98
age_group:version:condition	0.30
condition	0.31
version	0.99
version:condition	0.31

SynthPop

synthpop.org.uk

R package and shiny app for generating synthetic versions of sensitive microdata for statistical disclosure control.

Example Use

SYLLS Synthetic Data

Synthetic data are given only to approved researchers who are granted access to the original sensitive data after signing a disclaimer.

Substantial costs and time savings related to visits to safe havens by researchers can be made.

Getting Started

Read in data from the Scholarly Migration Database

file <- "https://raw.githubusercontent.com/MPIDR/Global-flows-and-rates-of-international-migration-of-scholars/master/data_processed/scopus_2024_V1_scholarlymigration_country_enriched.csv"
scholar <- readr::read_csv(file)
scholar_sa <- filter(scholar, region == "South Asia")
head(scholar_sa, 1) |> glimpse()

Rows: 1
Columns: 16
$ year                             <dbl> 1999
$ countrycode                      <chr> "AFG"
$ padded_population_of_researchers <dbl> 6
$ number_of_inmigrations           <dbl> 1
$ number_of_outmigrations          <dbl> 0
$ netmigration                     <dbl> 1
$ outmigrationrate                 <dbl> 0
$ inmigrationrate                  <dbl> 0.1666667
$ netmigrationrate                 <dbl> 0.1666667
$ iso2code                         <chr> "AF"
$ iso3code                         <chr> "AFG"
$ countryname                      <chr> "Afghanistan"
$ region                           <chr> "South Asia"
$ incomelevel                      <chr> "LIC"
$ gdp_per_capita                   <dbl> 311.8536
$ population                       <dbl> 19262847

Codebook.syn

synthpop::codebook.syn(scholar_sa)$tab

                           variable     class nmiss perctmiss ndistinct details
1                              year   numeric     0         0        23        
2                       countrycode character     0         0         8        
3  padded_population_of_researchers   numeric     0         0       168        
4            number_of_inmigrations   numeric     0         0       125        
5           number_of_outmigrations   numeric     0         0       122        
6                      netmigration   numeric     0         0        90        
7                  outmigrationrate   numeric     0         0       162        
8                   inmigrationrate   numeric     0         0       165        
9                  netmigrationrate   numeric     0         0       167        
10                         iso2code character     0         0         8        
11                         iso3code character     0         0         8        
12                      countryname character     0         0         8        
13                           region character     0         0         1        
14                      incomelevel character     0         0         3        
15                   gdp_per_capita   numeric     0         0       175        
16                       population   numeric     0         0       177

Ready to Synthesise

Remove identifiers
Change any character (text) variables into factors
Include non-NA missing values in cont.na argument to syn()
Remove redundant or derivable variables
Note variables that depend on other variables (set with rules and rvalues of syn(), or manually add after synthesis)
If >12 variables or any factors with >20 levels, create a smaller and simpler data frame

Clean Dataset

scholar2 <- scholar_sa |>
  # remove redundant to countryname
  select(-countrycode, -iso2code, -iso3code, -region) |>
  
  # netmigration = number_of_inmigrations - number_of_outmigrations
  select(-netmigration) |>
  
  # calculated from padded_population_of_researchers
  select(-c(outmigrationrate:netmigrationrate)) |>
  
  # text columns as factors, order incomelevel
  mutate(
    countryname = factor(countryname),
    incomelevel = factor(incomelevel, c("LIC", "LMC", "UMC", "HIC", "INX"))
  )

Simulate Data

scholar_sa_syn <- syn(scholar2)
summary(scholar_sa_syn)

Synthetic object with one synthesis using methods:
                            year padded_population_of_researchers 
                        "sample"                           "cart" 
          number_of_inmigrations          number_of_outmigrations 
                          "cart"                           "cart" 
                     countryname                      incomelevel 
                          "cart"                           "cart" 
                  gdp_per_capita                       population 
                          "cart"                           "cart" 

      year      padded_population_of_researchers number_of_inmigrations
 Min.   :1998   Min.   :     4                   Min.   :   0.0        
 1st Qu.:2003   1st Qu.:   171                   1st Qu.:  15.0        
 Median :2008   Median :  2262                   Median :  74.0        
 Mean   :2009   Mean   : 31462                   Mean   : 302.5        
 3rd Qu.:2014   3rd Qu.:  9869                   3rd Qu.: 261.0        
 Max.   :2020   Max.   :392111                   Max.   :2562.0        
                                                                       
 number_of_outmigrations      countryname incomelevel gdp_per_capita   
 Min.   :   0.0          Bangladesh :30   LIC: 23     Min.   :  182.2  
 1st Qu.:  11.0          Sri Lanka  :24   LMC:137     1st Qu.:  526.1  
 Median :  79.0          Afghanistan:23   UMC: 17     Median :  978.2  
 Mean   : 459.1          India      :23   HIC:  0     Mean   : 1643.6  
 3rd Qu.: 268.0          Pakistan   :22   INX:  0     3rd Qu.: 1815.6  
 Max.   :5285.0          Nepal      :20               Max.   :11349.9  
                         (Other)    :35                                
   population       
 Min.   :3.021e+05  
 1st Qu.:1.911e+07  
 Median :2.746e+07  
 Mean   :2.150e+08  
 3rd Qu.:1.618e+08  
 Max.   :1.396e+09

Check

compare(scholar_sa_syn, scholar2, stat = "counts")

Export

write.syn(scholar_sa_syn, "scholar_sa_syn", filetype = "csv")

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /Users/debruine/rproj/debruine/talks/rostock-datasim/synthesis_info_scholar_sa_syn.txt

Import and Recalculate

scholar_sa_syn2 <- readr::read_csv(
  file = "scholar_sa_syn.csv",
  col_types = cols(
    incomelevel = col_factor(c("LIC", "LMC", "UMC", "HIC", "INX"))
  )) |>
  mutate(
    netmigration = number_of_inmigrations - number_of_outmigrations,
    outmigrationrate = number_of_outmigrations /
      padded_population_of_researchers,
    inmigrationrate = number_of_inmigrations / 
      padded_population_of_researchers,
    netmigrationrate = inmigrationrate - outmigrationrate
  )

Plot Checks

Further Resources

Thank You!

https://debruine.github.io/talks/rostock-datasim/