Mann-Whitney False Positives
One of my favourite colleagues, Phil McAleer, asked about unequal sample sizes for Mann-Whitney tests on our group chat today. I had no idea, so, as always, I thought “This is a job for simulations!”
I started by loading tidyverse, since I know I’ll need to wrangle data and plot things. I’m starting to get more comfortable with base R for package development, and it can make things faster, but tidyverse is my favourite for a quick analysis or a pipeline where understandability is more important than speed.
And I set my favourite seed so my simulations will give me a reproducible answer.
library(tidyverse)
set.seed(8675309)
Then I wrote a wee function to simulate data with the parameters I’m interested in varying, run a Mann-Whitney test, and return the p-value (all I need to look at power and false positives).
First, I just wanted to look at false positives for different sample size, so I set n1
and n2
as arguments and set alternative
with a default of “two.sided”. The function wilcox.test
runs a Mann-Whitney test for independent samples.
mw <- function(n1, n2, alternative = "two.sided") {
x1 <- rnorm(n1)
x2 <- rnorm(n2)
w <- wilcox.test(x1, x2, alternative = alternative)
w$p.value
}
Now I need to set up a table with all of the values I want to run simulations for. I set n1
and n2
to the numbers 10 to 100 in steps of ten. This was crossed with the number of replications I wanted to run (1000). I then removed the values where n2
> n1
, since they’re redundant with the opposite version (e.g., n1 = 10, n2 = 20
is the same as n1 = 20, n2 = 10
).
params <- expand.grid(
n1 = seq(10, 100, 10),
n2 = seq(10, 100, 10),
reps = 1:1000
) %>%
filter(n1 <= n2)
n1 | n2 | reps |
---|---|---|
10 | 10 | 1 |
10 | 20 | 1 |
20 | 20 | 1 |
… | … | … |
80 | 100 | 1000 |
90 | 100 | 1000 |
100 | 100 | 1000 |
I then used the pmap_dbl
function from purrr
to map the values from n1
and n2
onto mw
, then grouped the results by n1
and n2
and calculated false_pos
as the proportion of p
less than alpha
.
alpha <- 0.05
mw1 <- params %>%
mutate(p = pmap_dbl(list(n1, n2), mw)) %>%
group_by(n1, n2) %>%
summarise(false_pos = mean(p < alpha), .groups = "drop")
Then I plotted the false positive rate for each combination against the difference between n1
and n2
. You can see that the false positive rate is approximately nominal, or equal to the specified alpha
of 0.05.
ggplot(mw1, aes(n2 - n1, false_pos)) +
geom_point(aes(color = as.factor(n1))) +
geom_smooth(formula = y ~ x, method = lm, color = "black") +
labs(x = "N2 - N1",
y = "False Positive Rate",
color = "N1") +
ylim(0, .1)
But what if data aren’t drawn from a normal distribution? We can change the mw()
function to simulate data from a different distribtion, such as uniform, and run the whole process again.
mw <- function(n1, n2, alternative = "two.sided") {
x1 <- runif(n1)
x2 <- runif(n2)
w <- wilcox.test(x1, x2, alternative = alternative)
w$p.value
}
The rest of our code is identical to above.
mw2 <- params %>%
mutate(p = pmap_dbl(list(n1, n2), mw)) %>%
group_by(n1, n2) %>%
summarise(false_pos = mean(p < alpha), .groups = "drop")
This doesn’t seem to make much difference.
What if the variance between the two samples is different? First, let’s adjust the mw()
function to vary the SD of the two samples. We’ll give sd1
a default value of 1 and sd2
will default to the same as sd1
. We might as well add the option to change the means, so default m1
to 0 and m2
to the same as m1
.
mw <- function(n1, m1 = 0, sd1 = 1,
n2 = n1, m2 = m1, sd2 = sd1,
alternative = "two.sided") {
x1 <- rnorm(n1, m1, sd1)
x2 <- rnorm(n2, m2, sd2)
w <- wilcox.test(x1, x2, alternative = alternative)
w$p.value
}
Now we need to set up a new list of parameters to change. The Ns didn’t make much difference last time, so let’s vary them in steps of 20 this time. We’ll vary sd1
and sd2
from 0.5 to 2 in steps of 0.5, and also only keep combinations where sd1
is less than or equal to sd2
to avoid redundancy.
params <- expand.grid(
reps = 1:1000,
n1 = seq(10, 100, 20),
n2 = seq(10, 100, 20),
sd1 = seq(0.5, 2, 0.5),
sd2 = seq(0.5, 2, 0.5)
) %>%
filter(n1 <= n2, sd1 <= sd2)
mw3 <- params %>%
mutate(p = pmap_dbl(list(n1, 0, sd1, n2, 0, sd2), mw)) %>%
group_by(n1, n2, sd1, sd2) %>%
summarise(false_pos = mean(p < alpha), .groups = "drop")
It looks like differences in SD make a big difference in the false positive rate, and the effect is bigger as Ns and SDs get more unequal.
ggplot(mw3, aes(sd2 - sd1, false_pos, color = as.factor(n2-n1))) +
geom_point() +
geom_smooth(formula = y ~ x, method = lm) +
labs(x = "SD2 - SD1",
y = "False Positive Rate",
color = "N2-N1")
I’ll leave it to the enterprising reader to simulate power for different effect sizes.