Simulate Correlated Variables
Lisa DeBruine
2023-09-24
Source:vignettes/rnorm_multi.Rmd
rnorm_multi.Rmd
The rnorm_multi()
function makes multiple normally
distributed vectors with specified parameters and relationships.
Quick example
For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.
dat <- rnorm_multi(n = 100,
mu = c(0, 20, 20),
sd = c(1, 5, 5),
r = c(0.5, 0.5, 0.25),
varnames = c("A", "B", "C"),
empirical = FALSE)
n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|
100 | A | 1.00 | 0.49 | 0.51 | -0.04 | 1.04 |
100 | B | 0.49 | 1.00 | 0.19 | 19.95 | 4.91 |
100 | C | 0.51 | 0.19 | 1.00 | 19.64 | 4.61 |
Specify correlations
You can specify the correlations in one of four ways:
- A single r for all pairs
- A vars by vars matrix
- A vars*vars length vector
- A vars*(vars-1)/2 length vector
One Number
If you want all the pairs to have the same correlation, just specify a single number.
bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])
n | var | a | b | c | d | e | mean | sd |
---|---|---|---|---|---|---|---|---|
100 | a | 1.00 | 0.32 | 0.27 | 0.34 | 0.17 | -0.10 | 1.03 |
100 | b | 0.32 | 1.00 | 0.34 | 0.16 | 0.22 | -0.16 | 1.09 |
100 | c | 0.27 | 0.34 | 1.00 | 0.24 | 0.21 | -0.05 | 1.05 |
100 | d | 0.34 | 0.16 | 0.24 | 1.00 | 0.27 | -0.02 | 1.02 |
100 | e | 0.17 | 0.22 | 0.21 | 0.27 | 1.00 | -0.10 | 0.97 |
Matrix
If you already have a correlation matrix, such as the output of
cor()
, you can specify the simulated data with that.
cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat,
varnames = colnames(cmat))
n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|
100 | Sepal.Length | 1.00 | -0.24 | 0.87 | 0.82 | 0.09 | 0.98 |
100 | Sepal.Width | -0.24 | 1.00 | -0.58 | -0.52 | 0.07 | 1.08 |
100 | Petal.Length | 0.87 | -0.58 | 1.00 | 0.96 | 0.04 | 1.03 |
100 | Petal.Width | 0.82 | -0.52 | 0.96 | 1.00 | 0.05 | 1.04 |
Vector (vars*vars)
You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.
cmat <- c(1, .3, .5,
.3, 1, 0,
.5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat,
varnames = c("first", "second", "third"))
n | var | first | second | third | mean | sd |
---|---|---|---|---|---|---|
100 | first | 1.00 | 0.31 | 0.48 | 0.05 | 1.02 |
100 | second | 0.31 | 1.00 | 0.01 | -0.14 | 0.86 |
100 | third | 0.48 | 0.01 | 1.00 | 0.02 | 1.12 |
Vector (vars*(vars-1)/2)
You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.
rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat,
varnames = letters[1:4])
n | var | a | b | c | d | mean | sd |
---|---|---|---|---|---|---|---|
100 | a | 1.00 | 0.28 | 0.48 | 0.60 | -0.06 | 1.16 |
100 | b | 0.28 | 1.00 | 0.10 | 0.10 | 0.11 | 1.12 |
100 | c | 0.48 | 0.10 | 1.00 | -0.21 | 0.10 | 0.98 |
100 | d | 0.60 | 0.10 | -0.21 | 1.00 | -0.10 | 1.09 |
empirical
If you want your samples to have the exact correlations,
means, and SDs you entered, set empirical
to TRUE.
bvn <- rnorm_multi(100, 5, 0, 1, .3,
varnames = letters[1:5],
empirical = T)
n | var | a | b | c | d | e | mean | sd |
---|---|---|---|---|---|---|---|---|
100 | a | 1.0 | 0.3 | 0.3 | 0.3 | 0.3 | 0 | 1 |
100 | b | 0.3 | 1.0 | 0.3 | 0.3 | 0.3 | 0 | 1 |
100 | c | 0.3 | 0.3 | 1.0 | 0.3 | 0.3 | 0 | 1 |
100 | d | 0.3 | 0.3 | 0.3 | 1.0 | 0.3 | 0 | 1 |
100 | e | 0.3 | 0.3 | 0.3 | 0.3 | 1.0 | 0 | 1 |
Pre-existing variables
Us rnorm_pre()
to create a vector with a specified
correlation to one or more pre-existing variables. The following code
creates a new column called B
with a mean of 10, SD of 2
and a correlation of r = 0.5 to the A
column.
dat <- rnorm_multi(varnames = "A") %>%
mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))
n | var | A | B | mean | sd |
---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 10.02 | 2.28 |
Set empirical = TRUE
to return a vector with the
exact specified parameters.
dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | 0.50 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 0.15 | 10.02 | 2.28 |
100 | C | 0.50 | 0.15 | 1.00 | 10.00 | 2.00 |
You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.
n | var | A | B | C | D | mean | sd |
---|---|---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | 0.50 | 0.1 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 0.15 | 0.2 | 10.02 | 2.28 |
100 | C | 0.50 | 0.15 | 1.00 | 0.3 | 10.00 | 2.00 |
100 | D | 0.10 | 0.20 | 0.30 | 1.0 | 0.00 | 1.00 |
Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.
dat$E <- rnorm_pre(dat, r = .9)
#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.