Simulate from Existing Data
library(faux)
library(tidyverse)
I added a new function to the package faux
to generate a new dataframe from an existing dataframe, simulating all numeric columns from normal distributions with the same mean and SD as the existing data and the same correlation structure as the existing data. (Update: faux is now on CRAN!)
For example, here is the relationship between speed and distance in the built-in dataset cars
.
cars %>%
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
You can create a new sample with the same parameters and 500 rows with the code sim_df(cars, 500)
.
sim_df(cars, 500) %>%
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
You can also optionally add grouping variables. For example, here is the relationship between sepal length and width in the built-in dataset iris
.
iris %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
And here is a new sample with 50 observations of each species, made with the code sim_df(iris, 100, "Species")
.
sim_df(iris, 50, between = "Species") %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
For now, the function only creates new variables sampled from a continuous normal distribution. I hope to add in other sampling distributions in the future. So you’d need to do any rounding or truncating yourself.
sim_df(iris, 50, between = "Species") %>%
mutate_if(is.numeric, round, 1) %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'