Examples and informal evaluations of various statistical significance tests for comparing observations generated from different distributions and families

- t-test is OK to perform for non-normal data and has the same false positive rate across different distributions, including skewed ones
- t-test has the greatest power of the bunch, with the small effect detected more often than when the other two tests were used with the same dataset
- with a per-group sample size of 100+, all three tests offer the same conclusion, so when working with larger samples just use a t-test because it’s easier to interpret/communicate and to perform (computationally speaking)

**NOTE**: Only some of the code has been included on this page. The full code (including chunks responsible for creating tables and charts) can be found in the R Markdown source document.

The two most difficult problems in statistics are:

- \(p \gg N\) (when the number of dimensions greatly exceeds the sample size)
- the question “Can I use a t-test here?”

There is a wealth of literature on methods and best practices for dealing with high dimensionality, but there is not a lot of consistency when it comes to knowing when a t-test is appropriate and recommending its use in practical settings. There are two schools of thought:

- there are those who assert that the data needs to be normally-distributed and a non-parametric test should be employed otherwise
- there are those who assert that the distributional assumptions of a t-test apply to the test statistic only; so the test can be employed even with skewed data

Pananos (2020) started his blog post with:

I hate the objection “I can’t use the t-test, my data aren’t normal”. I see it all the time on Cross Validated when a data analyst is tasked with analyzing an experiment after it has been performed. They have piles of data, thousands of observations, and they have no idea what to do with it. They know of the t-test, but they erroneously believe (through no fault of their own) that the t-test is only valid if their data are normal.

Honestly, it wasn’t until I read that post that I questioned my own belief that I’ve been taught. It’s easy to see how one might be led to believe this – to be indoctrinated into that first school of thought.

The

ttests are based on an assumption that data come from the normal distribution

writes Dalgaard (2008).

We shall suppose that the variables \(X_1, \ldots, X_n\) form a random sample from a normal distribution

write DeGroot & Schervish (2012).

Let \(X_1, \ldots, X_n\) be a random sample from a \(n(\mu_X, \sigma^2_X)\)…

write Casella & Berger (2002).

These notes are an extension of Pananos’ post on Cross Validated (n.d.). Here I examine how the data generating process (in this case the distribution from which the observations are drawn) and the choice of statistical test (t-test and its non-parametric alternatives Wilcoxon-Mann-Whitney test and Kolmogorov-Smirnov test) affect the results. The four scenarios (distributions) are:

The idea here is to specify different centers and shapes of the various distributions for every group, except the “none” group which we include for measuring type I error (aka false positive rate (FPR), aka false rejection rate (FRR), aka \(\alpha\)): the probability of incorrectly rejecting the null hypothesis when the null hypothesis is true. The null hypothesis will be true for the “none” group because it has the exact same distribution as the “control” group.

```
simulation_parameters <- list(
# Normal:
a = list(
control = list(mu = 20, sigma = 2),
none = list(mu = 20, sigma = 2),
small = list(mu = 22, sigma = 2),
medium = list(mu = 26, sigma = 2),
large = list(mu = 28, sigma = 2)
),
# Poisson:
b = list(
control = list(lambda = 2),
none = list(lambda = 2),
small = list(lambda = 4),
medium = list(lambda = 6),
large = list(lambda = 8)
),
# Gamma:
c = list(
control = list(shape = 2, rate = 1 / 100),
none = list(shape = 2, rate = 1 / 100),
small = list(shape = 4, rate = 1 / 100),
medium = list(shape = 6, rate = 1 / 100),
large = list(shape = 8, rate = 1 / 100)
),
# Beta:
d = list(
control = list(shape1 = 10, shape2 = 20),
none = list(shape1 = 10, shape2 = 20),
small = list(shape1 = 15, shape2 = 20),
medium = list(shape1 = 20, shape2 = 20),
large = list(shape1 = 25, shape2 = 20)
)
)
```

Let’s visualize what the different groups’ distributions look like (using the interval plot) with those parameters:

Next, we define a `simulate_data()`

function for simulating data given the various parameters.

```
simulate_data <- function(N, sim_params) {
# sim_params is a list of scenarios; each scenario will be a named list with
# "control", "none", etc.
simulations <- list(
a = map(sim_params[["a"]], ~ rnorm(N, mean = .x$mu, sd = .x$sigma)),
b = map(sim_params[["b"]], ~ rpois(N, lambda = .x$lambda)),
c = map(sim_params[["c"]], ~ rgamma(N, shape = .x$shape, rate = .x$rate)),
d = map(sim_params[["d"]], ~ rbeta(N, shape1 = .x$shape1, shape2 = .x$shape2))
)
# This will create a tibble with columns "scenario", "group", "y":
tidy_simulations <- simulations %>%
# Turn each scenario into a (wide) tibble with each group as a column:
map(as_tibble) %>%
# Then we want to reshape it into a long tibble:
map_dfr(
pivot_longer,
cols = all_of(names(sim_params[[1]])),
# ^ in case only "control" & "none" are provided
names_to = "group",
values_to = "y",
.id = "scenario"
)
return(tidy_simulations)
}
```

```
set.seed(20200921)
n_example_size <- 2000
example_data <- simulate_data(N = n_example_size, sim_params = simulation_parameters)
```

Here are the distributions of the data in the different scenarios from this one single simulated dataset of 2000 randomly generated observations **per group**:

Given a single (simulated) dataset, we need to perform each of the three statistical tests on each pair of groups, always comparing to the control group. This work is captured by the `analyze_data()`

function:

```
analyze_data <- function(simulated_data) {
comparison_groups <- c("none", "small", "medium", "large")
names(comparison_groups) <- paste("control vs", comparison_groups)
scenarios <- set_names(c("a", "b", "c", "d"))
comparisons <- map_dfr(
scenarios,
function(scenario) {
map_dfr(
comparison_groups, function(comparison_group) {
sim_data <- simulated_data[simulated_data[["scenario"]] == scenario, ] %>%
filter(group %in% c("control", comparison_group)) %>%
mutate(group = factor(group))
tests <- list(
t_test = t.test(y ~ group, data = sim_data),
wilcox_test = wilcox.test(y ~ group, data = sim_data),
ks_test = ks.test(
x = sim_data$y[sim_data$group == "control"],
y = sim_data$y[sim_data$group != "control"]
)
)
results <- map_dfr(tests, broom::tidy, .id = "statistical_test")
return(select(results, statistical_test, statistic, p.value, method))
},
.id = "comparison"
)
},
.id = "scenario"
)
return(comparisons)
}
```

Now we want to simulate and analyze thousands of times and also across different sample sizes (per group), to study the impact of sample sizes on robustness of those tests:

```
set.seed(42L)
# Number of replications to perform:
n_replications <- 100000
# Sample sizes to explore at each replication:
sample_sizes <- c(25, 50, 100)
analysis_results <- seq(1, n_replications) %>%
map_dfr(
function(replication) {
# At each rep, simulate & analyze data w/ various sample sizes:
analysis_result <- sample_sizes %>%
set_names() %>%
map(simulate_data, sim_params = simulation_parameters) %>%
map_dfr(analyze_data, .id = "sample_size")
return(analysis_result)
},
.id = "replication"
)
```

For each combination of scenario and sample size, we calculate the proportion of replications where the null hypothesis would be rejected (based on the p-value and significance level \(\alpha = 0.05\)). As a reminder, in case of the “control vs none” comparison, the null hypothesis (that the means/distributions are the same between the two groups) *is* true, so we would expect the proportion to equal \(\alpha\) – the probability of a false rejection.

Two-sided null hypothesis rejection rate | ||||
---|---|---|---|---|

At α=0.05 significance level | ||||

Statistical test by scenario | control vs none | control vs small | control vs medium | control vs large |

a (Normal), N = 25 | ||||

Student's t-test | 5.0% | 93.5% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 4.9% | 92.2% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.6% | 81.7% | 100.0% | 100.0% |

a (Normal), N = 50 | ||||

Student's t-test | 5.0% | 99.9% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 4.8% | 99.8% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.8% | 98.8% | 100.0% | 100.0% |

a (Normal), N = 100 | ||||

Student's t-test | 5.1% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.1% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.6% | 100.0% | 100.0% | 100.0% |

b (Poisson), N = 25 | ||||

Student's t-test | 5.0% | 98.2% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 4.9% | 97.6% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 0.9% | 82.4% | 100.0% | 100.0% |

b (Poisson), N = 50 | ||||

Student's t-test | 5.1% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 0.9% | 99.3% | 100.0% | 100.0% |

b (Poisson), N = 100 | ||||

Student's t-test | 4.9% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 4.9% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 0.7% | 100.0% | 100.0% | 100.0% |

c (Gamma), N = 25 | ||||

Student's t-test | 4.8% | 98.1% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 98.8% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.7% | 95.3% | 100.0% | 100.0% |

c (Gamma), N = 50 | ||||

Student's t-test | 4.9% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 4.0% | 100.0% | 100.0% | 100.0% |

c (Gamma), N = 100 | ||||

Student's t-test | 5.0% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.6% | 100.0% | 100.0% | 100.0% |

d (Beta), N = 25 | ||||

Student's t-test | 4.9% | 97.5% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 96.9% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.6% | 90.7% | 100.0% | 100.0% |

d (Beta), N = 50 | ||||

Student's t-test | 5.1% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.0% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 4.0% | 99.8% | 100.0% | 100.0% |

d (Beta), N = 100 | ||||

Student's t-test | 5.0% | 100.0% | 100.0% | 100.0% |

Wilcoxon-Mann-Whitney test | 5.1% | 100.0% | 100.0% | 100.0% |

Kolmogorov-Smirnov Test | 3.7% | 100.0% | 100.0% | 100.0% |

Rate of agreement on rejecting or failing to reject | ||||
---|---|---|---|---|

At α=0.05 significance level | ||||

Statistical tests | control vs none | control vs small | control vs medium | control vs large |

a (Normal), N = 25 | ||||

All three | 2.1% | 80.9% | 100.0% | 100.0% |

t-test and K-S | 2.2% | 81.1% | 100.0% | 100.0% |

t-test and Wilcox | 4.0% | 91.5% | 100.0% | 100.0% |

Wilcox and K-S | 2.5% | 81.3% | 100.0% | 100.0% |

a (Normal), N = 50 | ||||

All three | 2.1% | 98.7% | 100.0% | 100.0% |

t-test and K-S | 2.2% | 98.8% | 100.0% | 100.0% |

t-test and Wilcox | 3.9% | 99.8% | 100.0% | 100.0% |

Wilcox and K-S | 2.5% | 98.8% | 100.0% | 100.0% |

a (Normal), N = 100 | ||||

All three | 2.0% | 100.0% | 100.0% | 100.0% |

t-test and K-S | 2.1% | 100.0% | 100.0% | 100.0% |

t-test and Wilcox | 4.1% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 2.5% | 100.0% | 100.0% | 100.0% |

b (Poisson), N = 25 | ||||

All three | 0.7% | 82.3% | 100.0% | 100.0% |

t-test and K-S | 0.7% | 82.3% | 100.0% | 100.0% |

t-test and Wilcox | 4.0% | 97.4% | 100.0% | 100.0% |

Wilcox and K-S | 0.8% | 82.3% | 100.0% | 100.0% |

b (Poisson), N = 50 | ||||

All three | 0.7% | 99.3% | 100.0% | 100.0% |

t-test and K-S | 0.7% | 99.3% | 100.0% | 100.0% |

t-test and Wilcox | 3.9% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 0.8% | 99.3% | 100.0% | 100.0% |

b (Poisson), N = 100 | ||||

All three | 0.6% | 100.0% | 100.0% | 100.0% |

t-test and K-S | 0.6% | 100.0% | 100.0% | 100.0% |

t-test and Wilcox | 3.9% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 0.7% | 100.0% | 100.0% | 100.0% |

c (Gamma), N = 25 | ||||

All three | 1.9% | 94.7% | 100.0% | 100.0% |

t-test and K-S | 2.0% | 94.7% | 100.0% | 100.0% |

t-test and Wilcox | 3.4% | 97.9% | 100.0% | 100.0% |

Wilcox and K-S | 2.6% | 95.3% | 100.0% | 100.0% |

c (Gamma), N = 50 | ||||

All three | 1.9% | 100.0% | 100.0% | 100.0% |

t-test and K-S | 2.0% | 100.0% | 100.0% | 100.0% |

t-test and Wilcox | 3.3% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 2.7% | 100.0% | 100.0% | 100.0% |

c (Gamma), N = 100 | ||||

All three | 1.7% | 100.0% | 100.0% | 100.0% |

t-test and K-S | 1.9% | 100.0% | 100.0% | 100.0% |

t-test and Wilcox | 3.3% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 2.6% | 100.0% | 100.0% | 100.0% |

d (Beta), N = 25 | ||||

All three | 2.1% | 90.3% | 100.0% | 100.0% |

t-test and K-S | 2.2% | 90.4% | 100.0% | 100.0% |

t-test and Wilcox | 4.0% | 96.6% | 100.0% | 100.0% |

Wilcox and K-S | 2.6% | 90.5% | 100.0% | 100.0% |

d (Beta), N = 50 | ||||

All three | 2.2% | 99.8% | 100.0% | 100.0% |

t-test and K-S | 2.3% | 99.8% | 100.0% | 100.0% |

t-test and Wilcox | 4.1% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 2.7% | 99.8% | 100.0% | 100.0% |

d (Beta), N = 100 | ||||

All three | 2.1% | 100.0% | 100.0% | 100.0% |

t-test and K-S | 2.2% | 100.0% | 100.0% | 100.0% |

t-test and Wilcox | 4.1% | 100.0% | 100.0% | 100.0% |

Wilcox and K-S | 2.5% | 100.0% | 100.0% | 100.0% |

Across all the non-normal scenarios (b, c, d) it looks like the t-test is okay to perform for non-normal data. Across all the scenarios, it has the same false positive rate (type I error) across different distributions, including skewed ones. The conclusions from the t-test and its most similar non-parametric alternative – the Wilcoxon-Mann-Whitney test – are in greatest agreement, relative to comparing conclusions from Kolmogorov-Smirnov Test. Almost consistently, t-test has the greatest power of the three across different sample sizes, with the small effect detected more often than when the other two tests were used with the same dataset. With a per-group sample size of \(N=100\) or greater, all three tests offer the same conclusion – so when working with larger samples just use a t-test because it’s easier to interpret or communicate and to execute.

You might have noticed a few blue links with “W”s on this page. Those are links to the Wikipedia articles on those topics and if you hover over them, you will see a preview of the article. This is possible with the ContextCards library, based on the Popups extension for MediaWiki.

Casella, George, and Roger L. Berger. 2002. *Statistical Inference*. Cengage Learning.

Dalgaard, Peter. 2008. *Introductory Statistics with R*. Statistics and Computing. Springer New York.

DeGroot, Morris H., and Mark J. Schervish. 2012. *Probability and Statistics*. Addison-Wesley. https://www.pearson.com/us/higher-education/product/De-Groot-Probability-and-Statistics-3rd-Edition/9780201524888.html.

Henry, Lionel, and Hadley Wickham. 2020. *Purrr: Functional Programming Tools*. https://purrr.tidyverse.org/.

Hester, Jim. 2020. *Glue: Interpreted String Literals*. https://glue.tidyverse.org/.

Iannone, Richard, Joe Cheng, and Barret Schloerke. 2020. *Gt: Easily Create Presentation-Ready Display Tables*. https://gt.rstudio.com/.

Kay, Matthew. 2020. *ggdist: Visualizations of Distributions and Uncertainty*. https://doi.org/10.5281/zenodo.3879620.

O’Hara-Wild, Mitchell, and Alex Hayes. 2020. *Distributional: Vectorised Probability Distributions*. https://pkg.mitchelloharawild.com/distributional/.

Pananos, Demetri. 2020. “Nothing Is Normal so Don’t Worry About the T Test.” https://dpananos.github.io/posts/2019/08/blog-post-23/.

———. n.d. “How Can I Check If Nominal and Ordinal Data Is Normally Distributed (for Z-Test of Proportions).” Cross Validated. https://stats.stackexchange.com/q/434921.

Pedersen, Thomas Lin. 2020. *Patchwork: The Composer of Plots*. https://patchwork.data-imaginist.com/.

R Core Team. 2020. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vaughan, Davis, and Matt Dancho. 2018. *Furrr: Apply Mapping Functions in Parallel Using Futures*. https://davisvaughan.github.io/furrr/.

Wickham, Hadley. 2016. *Ggplot2: Elegant Graphics for Data Analysis*. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. *Dplyr: A Grammar of Data Manipulation*. https://dplyr.tidyverse.org/.

Wickham, Hadley, and Lionel Henry. 2020. *Tidyr: Tidy Messy Data*. https://tidyr.tidyverse.org/.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/bearloga/comparing-groups, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Popov (2020, Sept. 21). Simulation study of statistical methods for comparing groups. Retrieved from https://people.wikimedia.org/~bearloga/notes/comparing-groups/

BibTeX citation

@misc{popov2020comparing, author = {Popov, Mikhail}, title = {Simulation study of statistical methods for comparing groups}, url = {https://people.wikimedia.org/~bearloga/notes/comparing-groups/}, year = {2020} }