Multiple groups comparison

Laboratory of Statistics and Mathematics 2025/2026

Giuseppe Alfonzetti

Case Study

A retail chain asks:

Do customers from different cities spend differently in our stores?

Different cities (Milan, Paris, Madrid, Berlin);
Total purchase amount by customer.

Goal:

Help the retail chain understanding in which city customers spend more on average.

ANOVA tests

Compares averages of 3+ groups.

Does average spending differ across cities?

Intuition: the ANOVA test compares

Within-city variation;
Between-city variation .

If between-city variation is “larger” than within-city variation, then group averages differ!

Statistical Hypoteses:

\(H_0\): all means are equal;
\(H_1\): at least one mean differs;

Case A

dfA

# A tibble: 160 × 2
   city  spend
   <chr> <dbl>
 1 Milan  38.8
 2 Milan  45.4
 3 Milan  81.2
 4 Milan  51.4
 5 Milan  52.6
 6 Milan  84.3
 7 Milan  59.2
 8 Milan  24.7
 9 Milan  36.3
10 Milan  41.1
# ℹ 150 more rows

dfA |> 
    group_by(city) |> 
    summarise(
        av_spend = mean(spend),
        sd_spend = sd(spend),
        n = n()
        )

# A tibble: 4 × 4
  city   av_spend sd_spend     n
  <chr>     <dbl>    <dbl> <int>
1 Berlin     50.7    27.1     40
2 Madrid     53.9    10.5     40
3 Milan      50.9    18.0     40
4 Paris      60.1     8.44    40

dfA |> 
    ggplot(aes(x = city, y = spend, fill = city))+
    geom_boxplot()+
    theme_bw() +
    theme(legend.position = "none")+
    labs(x="", y="Total expenditure by customer (€)")

anova_modelA <- aov(spend~city, data=dfA)
summary(anova_modelA)

             Df Sum Sq Mean Sq F value Pr(>F)  
city          3   2298   765.9    2.48 0.0632 .
Residuals   156  48183   308.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Case B

dfB

# A tibble: 160 × 2
   city  spend
   <chr> <dbl>
 1 Milan  48.3
 2 Milan  49.3
 3 Milan  54.7
 4 Milan  50.2
 5 Milan  50.4
 6 Milan  55.2
 7 Milan  51.4
 8 Milan  46.2
 9 Milan  47.9
10 Milan  48.7
# ℹ 150 more rows

dfB |> 
    group_by(city) |> 
    summarise(
        av_spend = mean(spend),
        sd_spend = sd(spend),
        n = n()
        )

# A tibble: 4 × 4
  city   av_spend sd_spend     n
  <chr>     <dbl>    <dbl> <int>
1 Berlin     50.0     3.84    40
2 Madrid     54.5     5.24    40
3 Milan      50.1     2.69    40
4 Paris      60.0     4.22    40

dfB |> 
    ggplot(aes(x = city, y = spend, fill = city))+
    geom_boxplot()+
    theme_bw() +
    theme(legend.position = "none")+
    labs(x="", y="Total expenditure by customer (€)")

anova_modelB <- aov(spend~city, data=dfB)
summary(anova_modelB)

             Df Sum Sq Mean Sq F value Pr(>F)    
city          3   2695   898.3   53.44 <2e-16 ***
Residuals   156   2622    16.8                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which city differs?

TukeyHSD(anova_modelA)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = spend ~ city, data = dfA)

$city
                  diff         lwr       upr     p adj
Madrid-Berlin  3.24975  -6.9557295 13.455229 0.8416206
Milan-Berlin   0.21175  -9.9937295 10.417229 0.9999436
Paris-Berlin   9.38625  -0.8192295 19.591729 0.0835101
Milan-Madrid  -3.03800 -13.2434795  7.167479 0.8665177
Paris-Madrid   6.13650  -4.0689795 16.341979 0.4036555
Paris-Milan    9.17450  -1.0309795 19.379979 0.0946626

TukeyHSD(anova_modelB)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = spend ~ city, data = dfB)

$city
                 diff       lwr       upr     p adj
Madrid-Berlin  4.4980  2.117168  6.878832 0.0000137
Milan-Berlin   0.1615 -2.219332  2.542332 0.9980504
Paris-Berlin  10.0660  7.685168 12.446832 0.0000000
Milan-Madrid  -4.3365 -6.717332 -1.955668 0.0000294
Paris-Madrid   5.5680  3.187168  7.948832 0.0000001
Paris-Milan    9.9045  7.523668 12.285332 0.0000000

A special case of `lm()`

summary(anova_modelA)

             Df Sum Sq Mean Sq F value Pr(>F)  
city          3   2298   765.9    2.48 0.0632 .
Residuals   156  48183   308.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lm_modelA <- lm(spend ~ city, data = dfA)
summary(lm_modelA)


Call:
lm(formula = spend ~ city, data = dfA)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.692  -9.652  -1.070   8.530  64.378 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  50.6922     2.7788  18.243   <2e-16 ***
cityMadrid    3.2498     3.9298   0.827   0.4095    
cityMilan     0.2118     3.9298   0.054   0.9571    
cityParis     9.3862     3.9298   2.388   0.0181 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.57 on 156 degrees of freedom
Multiple R-squared:  0.04552,   Adjusted R-squared:  0.02716 
F-statistic:  2.48 on 3 and 156 DF,  p-value: 0.0632

anova(lm_modelA)

Analysis of Variance Table

Response: spend
           Df Sum Sq Mean Sq F value Pr(>F)  
city        3   2298  765.88  2.4796 0.0632 .
Residuals 156  48183  308.87                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1