Introduction to R

Laboratory of Statistics and Mathematics 2025/2026

Giuseppe Alfonzetti

Our goal

The data analysis pipeline:

Import

You will learn how to read external data sources from R. In particular, we will focus on

  • Universal text formats, as .csv.
  • Proprietary spreadsheets formats, as Excel .xls and .xlsx;

Tidy

You will know hot to store your data in a cosistent format.

  • Each row is an observation;
  • Each column a variable, with a unique name.
  • Each variable has a specific type (numeric, chracter, logical, etc..).
# A tibble: 234 × 7
   manufacturer displ model      class     cyl trans        hwy
   <chr>        <dbl> <chr>      <chr>   <int> <chr>      <int>
 1 audi           1.8 a4         compact     4 auto(l5)      29
 2 audi           1.8 a4         compact     4 manual(m5)    29
 3 audi           2   a4         compact     4 manual(m6)    31
 4 audi           2   a4         compact     4 auto(av)      30
 5 audi           2.8 a4         compact     6 auto(l5)      26
 6 audi           2.8 a4         compact     6 manual(m5)    26
 7 audi           3.1 a4         compact     6 auto(av)      27
 8 audi           1.8 a4 quattro compact     4 manual(m5)    26
 9 audi           1.8 a4 quattro compact     4 auto(l5)      25
10 audi           2   a4 quattro compact     4 manual(m6)    28
# ℹ 224 more rows

Transform

You will learn how to transform the data at your willingness:

  • Select the variables of interest;
  • Combine existing variables in new variables;
  • Filter relevant observations;
  • Reshape data;
# A tibble: 234 × 3
   displ class     hwy
   <dbl> <chr>   <int>
 1   1.8 compact    29
 2   1.8 compact    29
 3   2   compact    31
 4   2   compact    30
 5   2.8 compact    26
 6   2.8 compact    26
 7   3.1 compact    27
 8   1.8 compact    26
 9   1.8 compact    25
10   2   compact    28
# ℹ 224 more rows

Visualize

You will learn how to explore data patterns with visualisations.

Model

It’s the only step where “math” enters the game. Goes from simple descriptive statistics, to more elaborated modelling strategies. Often combined with visualisations.

Communicate

  • Write reports;
  • Choose appropriate viasualizations;
  • Highilight the results in terms of insights.

Why R?

Replicability!

Excel vs R

  • Consider creating a new variable in your dataset:
  • Now you share your spreadsheet with your colleague.
  • She will not know the formula within each cell unless she inspect all of them individually!

Excel vs R

  • Consider the same data in R, stored in the object my_data
my_data 
# A tibble: 4 × 2
  QuantityA QuantityB
      <dbl>     <dbl>
1        50        20
2        90        25
3        10        45
4        30        15
  • You can create the same Tot column you had before in Excel with
my_data |> 
  mutate(Tot = QuantityA + QuantityB)
# A tibble: 4 × 3
  QuantityA QuantityB   Tot
      <dbl>     <dbl> <dbl>
1        50        20    70
2        90        25   115
3        10        45    55
4        30        15    45
  • And now your colleague has no doubt about where those numbers come from!

Excel VS R

  • You are interested in computing the total number of products sold by each representative:
  • That’s a lot of point-and-clicks!
  • Will your colleague be able to reproduce your results?

Excel VS R

Consider the same dataset in R stored in the new object my_data2:

my_data2
# A tibble: 10 × 4
   SalesRepresentative Product QuantitySold Region
   <chr>               <chr>          <dbl> <chr> 
 1 Eleonor             A                 50 IT    
 2 Elizabeth           B                 90 UK    
 3 John                C                 10 FR    
 4 Matt                A                 34 DE    
 5 Eleonor             A                  5 IT    
 6 Elizabeth           C                  8 IT    
 7 John                B                 20 UK    
 8 Matt                B                  5 UK    
 9 Eleonor             A                 15 UK    
10 Elizabeth           C                 45 DE    

You get the same results seen in Excel by running

my_data2 |> 
  group_by(SalesRepresentative) |> 
  summarise(Total = sum(QuantitySold))
# A tibble: 4 × 2
  SalesRepresentative Total
  <chr>               <dbl>
1 Eleonor                70
2 Elizabeth             143
3 John                   30
4 Matt                   39

Hands-on

Access R from Web

Basic R grammar

  • Run a command:
    • Get the cursor on the line and press CTRL+ENTER on Windows, or CMD+ENTER on MacOS.
  • Assign a value to an object
a = 1                            # numeric value assigned with = 
b <- 2                           # numeric value assigne with <- 
c <- "ciao"                      # character value 
c                                # display the content stored in c
[1] "ciao"
  • Basic math
a+b                              # sum the values stored in a and b
[1] 3
  • Create a vector
v1 <- c(1, 7, 2)                 # create the vector manually
v2 <- 1:10                       # create a sequence 1 to 10 automatically
v3 <- seq(1,10,by=2)             # create a sequence automatically with the seq() function
  • Seek help
?seq()                           # R will show you the documentation of the seq function
  • Create and inspect matrices
m <- matrix(1:6, nrow=2, ncol=3) # create a matrix
nrow(m)                          # inspect the number of rows
[1] 2
ncol(m)                          # inspect the number of columns
[1] 3
m[1,2]                           # display element on the first row, at the second column
[1] 3

Quick visualization

Some test datasets are preloaded in R. That’s the case of iris, a famous dataset about iris flower dimensions

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

You can easily visualise petal dimension distribution yourself by using the ggplot2package. Don’t worry about understanding the syntax for now!

ggplot(data=iris, aes(x=Petal.Length, y=Petal.Width)) + geom_point(aes(col=Species))