Data manipulation

Laboratory of Statistics and Mathematics 2025/2026

Giuseppe Alfonzetti

Working with multiple data frames

Case study

Data: Information about 10 women in science who changed the world

# A tibble: 10 × 1
   name              
   <chr>             
 1 Ada Lovelace      
 2 Marie Curie       
 3 Janaki Ammal      
 4 Chien-Shiung Wu   
 5 Katherine Johnson 
 6 Rosalind Franklin 
 7 Vera Rubin        
 8 Gladys West       
 9 Flossie Wong-Staal
10 Jennifer Doudna

Information spread across three different datasets

Professions
Dates
Works

professions

# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist

dates

# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947         NA
8 Jennifer Doudna          1964         NA

works

# A tibble: 9 × 2
  name               known_for                                                  
  <chr>              <chr>                                                      
1 Ada Lovelace       first computer algorithm                                   
2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
3 Janaki Ammal       hybrid species, biodiversity protection                    
4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
6 Vera Rubin         existence of dark matter                                   
7 Gladys West        mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…

Desired output

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Joining data frames

left_join(): all rows from x
right_join(): all rows from y
full_join(): all rows from both x and y
semi_join(): all rows from x where there are matching values in y, keeping just columns from x
inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x
…

Example

# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3

# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4

`left_join()`

x |> 
    left_join(y)

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>

x |> 
    left_join(y, by=c("id"="id"))

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>

`right_join()`

x |> 
    right_join(y)

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4

x |> 
    right_join(y, by=c("id"="id"))

# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4

`full_join()`

x |> 
    full_join(y)

# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4

x |> 
    full_join(y, by=c("id"="id"))

# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4

`inner_join()`

x |> 
    inner_join(y)

# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2

x |> 
    inner_join(y, by=c("id"="id"))

# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2

`semi_join()`

x |> 
    semi_join(y)

# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2

x |> 
    semi_join(y, by=c("id"="id"))

# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2

`anti_join()`

x |> 
    anti_join(y)

# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3

x |> 
    anti_join(y, by=c("id"="id"))

# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3

Going back to the data

professions |> 
    left_join(dates) |>  # join by name
    left_join(works)     # join by name

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947         NA first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Pivoting

Case study

customers

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

prices

# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3

Desired output

# A tibble: 6 × 4
  customer_id item_no item         price
        <dbl> <chr>   <chr>        <dbl>
1           1 item_1  bread         1   
2           1 item_2  milk          0.8 
3           1 item_3  banana        0.15
4           2 item_1  milk          0.8 
5           2 item_2  toilet paper  3   
6           2 item_3  <NA>         NA

From wide to long format

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

`pivot_longer()`

From customers to purchases:

purchases <- customers |> 
  pivot_longer(
    cols = c(item_1, item_2, item_3),  # variables to pivot
    names_to = "item_no",              # column names -> new column called item_no
    values_to = "item"                 # values in columns -> new column called item
    )

# A tibble: 6 × 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>

`pivot_wider()`

Going back from purchases to customers:

purchases |> 
  pivot_wider(
    names_from = "item_no",              # column names 
    values_from = "item"                 # values in columns 
    )

# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>

Results

purchases |> 
  left_join(prices)

# A tibble: 6 × 4
  customer_id item_no item         price
        <dbl> <chr>   <chr>        <dbl>
1           1 item_1  bread         1   
2           1 item_2  milk          0.8 
3           1 item_3  banana        0.15
4           2 item_1  milk          0.8 
5           2 item_2  toilet paper  3   
6           2 item_3  <NA>         NA