Mann-Whitney False Positives

One of my favourite colleagues, Phil McAleer, asked about unequal sample sizes for Mann-Whitney tests on our group chat today. I had no idea, so, as always, I thought “This is a job for simulations!”

I started by loading tidyverse, since I know I’ll need to wrangle data and plot things. I’m starting to get more comfortable with base R for package development, and it can make things faster, but tidyverse is my favourite for a quick analysis or a pipeline where understandability is more important than speed.

And I set my favourite seed so my simulations will give me a reproducible answer.


Then I wrote a wee function to simulate data with the parameters I’m interested in varying, run a Mann-Whitney test, and return the p-value (all I need to look at power and false positives).

First, I just wanted to look at false positives for different sample size, so I set n1 and n2 as arguments and set alternative with a default of “two.sided”. The function wilcox.test runs a Mann-Whitney test for independent samples.

mw <- function(n1, n2, alternative = "two.sided") {
  x1 <- rnorm(n1)
  x2 <- rnorm(n2)
  w <- wilcox.test(x1, x2, alternative = alternative)

Now I need to set up a table with all of the values I want to run simulations for. I set n1 and n2 to the numbers 10 to 100 in steps of ten. This was crossed with the number of replications I wanted to run (1000). I then removed the values where n2 > n1, since they’re redundant with the opposite version (e.g., n1 = 10, n2 = 20 is the same as n1 = 20, n2 = 10).

params <- expand.grid(
  n1 = seq(10, 100, 10), 
  n2 = seq(10, 100, 10),
  reps = 1:1000
) %>%
  filter(n1 <= n2)
n1 n2 reps
10 10 1
10 20 1
20 20 1
80 100 1000
90 100 1000
100 100 1000

I then used the pmap_dbl function from purrr to map the values from n1 and n2 onto mw, then grouped the results by n1 and n2 and calculated false_pos as the proportion of p less than alpha.

alpha <- 0.05

mw1 <- params %>%
  mutate(p = pmap_dbl(list(n1, n2), mw)) %>%
  group_by(n1, n2) %>%
  summarise(false_pos = mean(p < alpha), .groups = "drop")

Then I plotted the false positive rate for each combination against the difference between n1 and n2. You can see that the false positive rate is approximately nominal, or equal to the specified alpha of 0.05.

ggplot(mw1, aes(n2 - n1, false_pos)) +
  geom_point(aes(color = as.factor(n1))) +
  geom_smooth(formula = y ~ x, method = lm, color = "black") +
  labs(x = "N2 - N1", 
         y = "False Positive Rate",
         color = "N1") +
  ylim(0, .1)

But what if data aren’t drawn from a normal distribution? We can change the mw() function to simulate data from a different distribtion, such as uniform, and run the whole process again.

mw <- function(n1, n2, alternative = "two.sided") {
  x1 <- runif(n1)
  x2 <- runif(n2)
  w <- wilcox.test(x1, x2, alternative = alternative)

The rest of our code is identical to above.

mw2 <- params %>%
  mutate(p = pmap_dbl(list(n1, n2), mw)) %>%
  group_by(n1, n2) %>%
  summarise(false_pos = mean(p < alpha), .groups = "drop")

This doesn’t seem to make much difference.

What if the variance between the two samples is different? First, let’s adjust the mw() function to vary the SD of the two samples. We’ll give sd1 a default value of 1 and sd2 will default to the same as sd1. We might as well add the option to change the means, so default m1 to 0 and m2 to the same as m1.

mw <- function(n1, m1 = 0, sd1 = 1, 
               n2 = n1, m2 = m1, sd2 = sd1,
               alternative = "two.sided") {
  x1 <- rnorm(n1, m1, sd1)
  x2 <- rnorm(n2, m2, sd2)
  w <- wilcox.test(x1, x2, alternative = alternative)

Now we need to set up a new list of parameters to change. The Ns didn’t make much difference last time, so let’s vary them in steps of 20 this time. We’ll vary sd1 and sd2 from 0.5 to 2 in steps of 0.5, and also only keep combinations where sd1 is less than or equal to sd2 to avoid redundancy.

params <- expand.grid(
  reps = 1:1000,
  n1 = seq(10, 100, 20), 
  n2 = seq(10, 100, 20),
  sd1 = seq(0.5, 2, 0.5),
  sd2 = seq(0.5, 2, 0.5)
) %>%
  filter(n1 <= n2, sd1 <= sd2)
mw3 <- params %>%
  mutate(p = pmap_dbl(list(n1, 0, sd1, n2, 0, sd2), mw)) %>%
  group_by(n1, n2, sd1, sd2) %>%
  summarise(false_pos = mean(p < alpha), .groups = "drop")

It looks like differences in SD make a big difference in the false positive rate, and the effect is bigger as Ns and SDs get more unequal.

ggplot(mw3, aes(sd2 - sd1, false_pos, color = as.factor(n2-n1))) +
  geom_point() +
  geom_smooth(formula = y ~ x, method = lm) +
  labs(x = "SD2 - SD1", 
         y = "False Positive Rate",
         color = "N2-N1")

I’ll leave it to the enterprising reader to simulate power for different effect sizes.

Lisa DeBruine
Lisa DeBruine
Professor of Psychology

Lisa DeBruine is a professor of psychology at the University of Glasgow. Her substantive research is on the social perception of faces and kinship. Her meta-science interests include team science (especially the Psychological Science Accelerator), open documentation, data simulation, web-based tools for data collection and stimulus generation, and teaching computational reproducibility.