Concept of ANOVA and Its sample size calculation formula

We have groups, each with observations:

Total sample size: Overall mean: Group means:

ANOVA decomposition

The total sum of squares:

and this can be decomposed into:

Were, known as sum of square due to treatments.

And, known as sum of square due to random errors.

Distribution of Sum of Squares

Under Null Hypothesis: (say constant)

is just random noise (central chi-square)

Under Alternative Hypothesis:

has a mean shift due to the true group effect.

This gives a non-central chi-square where is the non-centrality parameter.

Deriving the non-centrality parameter

By definition, for non-central chi-square:

Here, mean shift for group . Each group mean is based on observations → variance of is . So, the contribution to for each group is:

Summing over all groups: This is exactly the non-centrality parameter for one-way ANOVA.

F-statistic distribution

Under : (central)

Under : (non-central)

Rejection criteria for Null hypothesis

Let level of significance:

(1) Cut-off method: F > : Reject the Null hypothesis

(2) P-value approach: <0.05 : Reject the Null hypothesis

Sample size calculation formula:

For statistical power and sample size calculations, the non-central F distribution is applied with Cohen s f effect size. This approach is used because the goal is to determine the sample size needed to detect meaningful differences in treatment effects on the outcome.

Example: One-Way ANOVA and the Logic of Hypothesis Testing

Imagine a study with three different treatment groups. Each group receives its own intervention, and we measure a continuous outcome such as blood pressure, weight loss, or test scores. The central question is whether the treatments truly differ in their effects on average, or whether any observed differences could simply be explained by random variation.

To explore this, consider a simulation where the population means are set to , with a common standard deviation of = 4. From these populations, we draw 60 observations in total (20 per group). The resulting sample means are approximately , with an overall mean of (see Figure 1).

The hypotheses are:

Null hypothesis:
Alternative hypothesis: Not all are equal.

Applying one-way ANOVA gives an observed F-statistic of 9.38. The question then becomes: is 9.38 large enough to reject the null? Under , the F-statistic follows a central F distribution with degrees of freedom this corresponds to the black curve in Figure 2. Under , the F-statistic instead follows a noncentral F distribution the red curve in Figure 2.

At the 5% significance level, the critical value is about 3.16. Since our observed statistic F = 9.38 is far to the right of this cutoff, we reject the null hypothesis and conclude that not all treatments have the same mean effect. The p-value formalizes this:

which here is roughly 0.0003. Such a tiny probability indicates it would be extremely unlikely to see an F-statistic this large if the group means were truly equal. Importantly:

The p-value does not describe the treatment groups sample distributions.
It refers to the sampling distribution of the test statistic under the null.
In other words, it is a probability statement about F, not about the raw data themselves.

Figure 3 illustrates this idea in the same spirit as Figure 2.

Because we simulated with genuine mean differences, we can also discuss power the probability of rejecting when it is false. In this case, the theoretical noncentrality parameter is λ ≈ 15.8, which shifts the F distribution to the right. Simulation shows that with 20 subjects per group and these mean differences; the test achieves about 94% power at the 5% significance level.

By contrast, if we repeat the simulation assuming no true treatment differences in the population (say, all means equal to 10), the resulting test statistic is much smaller: F = 0.25, well below the critical value of 3.16. In this case, we would accept the null hypothesis of equal means. Figures 4 6 visualize the sample distributions of the groups, the sampling distributions of the F-statistic under both and , and the theoretical population-level projections of those distributions.

Sample Size Calculation for One-Way ANOVA

Determining the required sample size is an important part of designing an ANOVA study. The goal is to ensure that the test has enough power to detect meaningful differences between group means if such differences truly exist. Two practical approaches are commonly used: a direct method based on pilot means and standard deviations, and a statistically rigorous method using the non-central F distribution.

Approach 1: Direct Formula Using Group Means and Standard Deviations

When pilot or published data already provide estimates of the group means and a common standard deviation , the sample size can be estimated using a simplified normal-approximation formula. This approach directly uses the magnitude of real differences among the expected means.

First compute the deviation of each group mean from the overall mean:

Then the per-group sample size can be approximated as:

This formula increases the required sample size when:

within-group variability is large,
expected differences among group means are small,
or higher power is desired.

This method is intuitive, easy to apply, and produces a data-driven estimate. It is best used when realistic pilot means, and standard deviations are available.

Approach 2: Using the Non-Central F Distribution (Statistically Rigorous Method)

ANOVA decisions are based on the F-statistic, which under the alternative hypothesis follows a non-central F distribution with non-centrality parameter . Using this parameter, the power of the ANOVA test is:

The sample size is chosen such that this probability reaches the target power (e.g., 80% or 90%). Although this must be solved numerically, it is the exact method and directly matches the theory used in ANOVA.

An equivalent and widely used re-expression uses the standardized effect size known as Cohen s :

;

Where

This method is preferred when:

no pilot means are available,
effect sizes are chosen based on prior literature,
or the researcher wants a rigorous, theory-driven estimate.

For more details on sample size calculation, visit this sample size calculation tool: StudySizer.streamlit.app

R Code

# ------------------------------

# Simulation Example: 1

# ------------------------------

##-------- Simulate sample ---------##

set.seed(123)

# Parameters

k= 3 # number of groups

n= 20 # sample size per group

sigma= 4 # common sd

mu= c(10, 12, 15) # true means for groups (H1: not all equal)

# Generate data

group= rep(1:k, each = n)

values= c(rnorm(n, mean = mu[1], sd = sigma),

rnorm(n, mean = mu[2], sd = sigma),

rnorm(n, mean = mu[3], sd = sigma))

## Data

data= data.frame(group = factor(group), y = values)

##-------- Run ANOVA ---------##

anova_model= aov(y ~ group, data = data)

summary(anova_model)

##-------- Manual decomposition ---------##

grand_mean= mean(values)

group_means= tapply(values, group, mean)

# SST (between-group)

SST= sum(n * (group_means - grand_mean)^2)

# SSE (within-group)

SSE= sum((values - rep(group_means, each = n))^2)

# TSS

TSS= SST + SSE

# F statistic

df_between= k - 1

df_within= k * (n - 1)

Fstat= (SST / df_between) / (SSE / df_within)

# Theoretical noncentrality λ

lambda= n * sum((mu - mean(mu))^2) / sigma^2

# Compare with observed F

crit= qf(0.95, df_between, df_within) # critical F at 5% level

# P-value

pval= pf(Fstat,df_between,df_within,lower.tail = F)

# Simulation for Empirical power

nsim = 5000

rejects = replicate(nsim, {

values = c(rnorm(n, mean = mu[1], sd = sigma),

rnorm(n, mean = mu[2], sd = sigma),

rnorm(n, mean = mu[3], sd = sigma))

group = rep(1:k, each = n)

model = aov(values ~ factor(group))

pval = summary(model)[[1]]$`Pr(>F)`[1]

pval < 0.05

})

mean(rejects) # estimated power

##---------- Visualization -----------##

##-- Kernel Density Estimates of the Sample Data from the 3 Treatment Groups.

plot(density(data$y[group==1]),lty=1,ylim=c(0,0.12),xlim=c(0,28),main="Figure 1: Sample distribution of 3 Treatment groups")

lines(density(data$y[group==2]),lty=2,col="red")

lines(density(data$y[group==3]),lty=3,col="blue")

legend("topright", legend=c("Treatment 1","Treatment 2","Treatment 3"),

col=c("black","red","blue"),

lty=c(1,2,3))

legend("topleft", legend=c("mu1 = 10","mu2 = 12","mu3 = 15"),

col=c("black","red","blue"),

lty=c(1,2,3))

# Sample distribution of the test statistic at Null and Alternative hypothesis

plot(density(rf(1000,df_between,df_within)),xlim=c(0,10),main="Figure 2: Sample distribution of the test statistic at Null and Alternative Hypothesis")

lines(density(rf(1000,df_between,df_within,ncp = lambda)),lty=2,col="red")

abline(v = Fstat, col = "blue", lwd = 2)

legend("topright", legend = c("Central F", "Noncentral F", "Observed F"),

lty = c(1,2,1), col = c("black","red","blue"), lwd = c(1,1,2))

crit = qf(0.95, df_between, df_within)

abline(v = crit, col = "darkgreen", lwd = 2, lty = 3)

legend("topright", legend=c("Central F","Noncentral F","Observed F","Critical Value"),

col=c("black","red","blue","darkgreen"),

lty=c(1,2,1,3), lwd=c(1,1,2,2))

# Population Projection of the test statistic at Null and Alternative Hypothesis

curve(df(x, df_between, df_within), from=0, to=10, col="black",ylab= "Density",xlab = "Bins",main= "Figure 3: Population Projection of the test statistic at Null and Alternative Hypothesis")

curve(df(x, df_between, df_within, ncp=lambda), from=0, to=10, col="red", add=TRUE, lty=2)

abline(v=Fstat, col="blue", lwd=2)

abline(v = crit, col = "darkgreen", lwd = 2, lty = 3)

legend("topright", legend=c("Central F","Noncentral F","Observed Value","Critical Value"),

col=c("black","red","blue","darkgreen"),

lty=c(1,2,1,3), lwd=c(1,1,2,2))

# ------------------------------

# Simulation Example: 2

# ------------------------------

##-------- Simulate sample ---------##

set.seed(123)

# Parameters

k= 3 # number of groups

n= 20 # sample size per group

sigma= 4 # common sd

mu= c(10, 10, 10) # true means for groups (H1: not all equal)

# Generate data

group= rep(1:k, each = n)

values= c(rnorm(n, mean = mu[1], sd = sigma),

rnorm(n, mean = mu[2], sd = sigma),

rnorm(n, mean = mu[3], sd = sigma))

## Data

data= data.frame(group = factor(group), y = values)

##-------- Run ANOVA ---------##

anova_model= aov(y ~ group, data = data)

summary(anova_model)

##-------- Manual decomposition ---------##

grand_mean= mean(values)

group_means= tapply(values, group, mean)

# SST (between-group)

SST= sum(n * (group_means - grand_mean)^2)

# SSE (within-group)

SSE= sum((values - rep(group_means, each = n))^2)

# TSS

TSS= SST + SSE

# F statistic

df_between= k - 1

df_within= k * (n - 1)

Fstat= (SST / df_between) / (SSE / df_within)

# Theoretical noncentrality λ

lambda= n * sum((mu - mean(mu))^2) / sigma^2

# Compare with observed F

crit= qf(0.95, df_between, df_within) # critical F at 5% level

# P-value

pval= pf(Fstat,df_between,df_within,lower.tail = F)

# Simulation for Empirical power

nsim = 5000

rejects = replicate(nsim, {

values = c(rnorm(n, mean = mu[1], sd = sigma),

rnorm(n, mean = mu[2], sd = sigma),

rnorm(n, mean = mu[3], sd = sigma))

group = rep(1:k, each = n)

model = aov(values ~ factor(group))

pval = summary(model)[[1]]$`Pr(>F)`[1]

pval < 0.05

})

mean(rejects) # estimated power

##---------- Visualization -----------##

##-- Kernel Density Estimates of the Sample Data from the 3 Treatment Groups.

plot(density(data$y[group==1]),lty=1,ylim=c(0,0.12),xlim=c(0,25),main="Figure 4: Sample distribution of 3 Treatment groups")

lines(density(data$y[group==2]),lty=2,col="red")

lines(density(data$y[group==3]),lty=3,col="blue")

legend("topright", legend=c("Treatment 1","Treatment 2","Treatment 3"),

col=c("black","red","blue"),

lty=c(1,2,3))

legend("topleft", legend=c("mu1 = 10","mu2 = 10","mu3 = 10"),

col=c("black","red","blue"),

lty=c(1,2,3))

# Sample distribution of the test statistic at Null and Alternative hypothesis

plot(density(rf(1000,df_between,df_within)),xlim=c(0,10),main="Figure 5: Sample distribution of the test statistic at Null and Alternative Hypothesis")

lines(density(rf(1000,df_between,df_within,ncp = lambda)),lty=2,col="red")

abline(v = Fstat, col = "blue", lwd = 2)

legend("topright", legend = c("Central F", "Noncentral F", "Observed F"),

lty = c(1,2,1), col = c("black","red","blue"), lwd = c(1,1,2))

crit = qf(0.95, df_between, df_within)

abline(v = crit, col = "darkgreen", lwd = 2, lty = 3)

legend("topright", legend=c("Central F","Noncentral F","Observed F","Critical Value"),

col=c("black","red","blue","darkgreen"),

lty=c(1,2,1,3), lwd=c(1,1,2,2))

# Population Projection of the test statistic at Null and Alternative Hypothesis

curve(df(x, df_between, df_within), from=0, to=10, col="black",ylab= "Density",xlab = "Bins",main= "Figure 6: Population Projection of the test statistic at Null and Alternative Hypothesis")

curve(df(x, df_between, df_within, ncp=lambda), from=0, to=10, col="red", add=TRUE, lty=2)

abline(v=Fstat, col="blue", lwd=2)

abline(v = crit, col = "darkgreen", lwd = 2, lty = 3)

legend("topright", legend=c("Central F","Noncentral F","Observed Value","Critical Value"),

col=c("black","red","blue","darkgreen"),

lty=c(1,2,1,3), lwd=c(1,1,2,2))