Solutions Lab 3 practice

Practice 1

Setup

I saved my R code in a script called practice1.R which is stored in the project folder.

First, we load all the packages that we are going to use

library(tidyverse)
library(here)
library(readxl)
library(writexl)

Then, we read the data.

# declare where you are in your project folder
i_am("practice1.R")

# construct the path to your data
path <- here("data", "employees.xlsx")

# read the data and store it in dt
dt <- read_xlsx(path=path)

Exercise 1

The function dim() returns the dimensions of the input. The first number is the number of rows. The second denotes the number of columns.

dim(dt)

[1] 150   6

Thus, the dataset stored in dt collects information about 150 employees.

You can extract the same dimensions by using the specialised functions nrow()and ncol()

nrow(dt)

[1] 150

ncol(dt)

[1] 6

A convenient way to quickly summarise the variables in your dataset is the function summary().

summary(dt)

  last_name          first_name         department          seniority     
 Length:150         Length:150         Length:150         Min.   : 0.000  
 Class :character   Class :character   Class :character   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 7.000  
                                                          Mean   : 7.487  
                                                          3rd Qu.:12.000  
                                                          Max.   :15.000  
     salary             ID      
 Min.   : 42664   Min.   :1001  
 1st Qu.: 74397   1st Qu.:1038  
 Median :102606   Median :1076  
 Mean   :103789   Mean   :1076  
 3rd Qu.:129346   3rd Qu.:1113  
 Max.   :192685   Max.   :1150

For each variable in your dataset, it returns some summary statistics. In case of numeric values, as for seniority and salary, it provides you with mean and quantiles. You can extract the same information by using the corresponding specialised functions (recall that the $operator allows you to access a given column from a data.frame object):

mean(dt$seniority)

[1] 7.486667

median(dt$seniority)

[1] 7

quantile(dt$seniority)

  0%  25%  50%  75% 100% 
   0    4    7   12   15

Exercise 2

Exercise 2.1

We are interested in the top 3 earners from the I.T. deprtament. The arrange() function arranges values in ascending order by default. To switch to descending order you can either ask arrange() to sort according to -salary or wrap it in the desc() function:

dt |> 
    filter(department=="I.T.") |> 
    arrange(-salary)

# A tibble: 25 × 6
   last_name      first_name department seniority  salary    ID
   <chr>          <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Tutt           Jerry      I.T.              15 192685.  1032
 2 Aldaz          Alexis     I.T.              15 192046.  1058
 3 Clark          Harrison   I.T.              15 190690.  1114
 4 al-El-Sayed    Muna       I.T.              14 181644.  1052
 5 Running Rabbit Alexandra  I.T.              13 172680.  1103
 6 Martinez       Leydy      I.T.              12 165690.  1138
 7 Theno          Kiana      I.T.              11 154312.  1140
 8 Shazier        Alea       I.T.              10 148184.  1012
 9 Sherbon        Sydney     I.T.              10 146050.  1048
10 Hamlin         Connor     I.T.               9 139380.  1057
# ℹ 15 more rows

dt |> 
    filter(department=="I.T.") |> 
    arrange(desc(salary))

# A tibble: 25 × 6
   last_name      first_name department seniority  salary    ID
   <chr>          <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Tutt           Jerry      I.T.              15 192685.  1032
 2 Aldaz          Alexis     I.T.              15 192046.  1058
 3 Clark          Harrison   I.T.              15 190690.  1114
 4 al-El-Sayed    Muna       I.T.              14 181644.  1052
 5 Running Rabbit Alexandra  I.T.              13 172680.  1103
 6 Martinez       Leydy      I.T.              12 165690.  1138
 7 Theno          Kiana      I.T.              11 154312.  1140
 8 Shazier        Alea       I.T.              10 148184.  1012
 9 Sherbon        Sydney     I.T.              10 146050.  1048
10 Hamlin         Connor     I.T.               9 139380.  1057
# ℹ 15 more rows

If you want to focus on the top three and drop all the other employess, you might want to leverage the function row_number()

dt |> 
    filter(department=="I.T.") |> 
    arrange(desc(salary)) |> 
    filter(row_number()<=3)   # after arranging, you retain only employees in the top 3 rows

# A tibble: 3 × 6
  last_name first_name department seniority  salary    ID
  <chr>     <chr>      <chr>          <dbl>   <dbl> <dbl>
1 Tutt      Jerry      I.T.              15 192685.  1032
2 Aldaz     Alexis     I.T.              15 192046.  1058
3 Clark     Harrison   I.T.              15 190690.  1114

Exercise 2.2

We now want to compute the median salary of the I.T. department. There are many ways to do it. The most immediate is to use

dt |> 
    group_by(department) |> 
    summarise(salary=median(salary))

# A tibble: 6 × 2
  department  salary
  <chr>        <dbl>
1 Accounting  80716.
2 Finance     90454.
3 HR         110257.
4 I.T.       119080.
5 Marketing  107778.
6 Sales      106649.

which evaluates the median salary of each department, included the I.T. one. If you want to focus on the I.T department, you can use the filter() function. In this case, from a logical point of view, it is equivalent if you call filter() before or after summarise()

dt |> 
    filter(department=="I.T.") |> 
    group_by(department) |> 
    summarise(salary=median(salary))

# A tibble: 1 × 2
  department  salary
  <chr>        <dbl>
1 I.T.       119080.

dt |> 
    group_by(department) |> 
    summarise(salary=median(salary)) |> 
    filter(department=="I.T.")

# A tibble: 1 × 2
  department  salary
  <chr>        <dbl>
1 I.T.       119080.

Exercise 2.3

The exercise asks to construct a scatterplot for salary and seniority, colored by department using the ggplot2 package, which is automatically loaded with tidyverse

salary_plot <- ggplot(dt, aes(x=seniority, y=salary, color=department))+
                    geom_point()
salary_plot

The ggplot2 package is one of the most powerful plotting software available on the market, and provides the user with a wide set of tools to produce highly flexible and customisable graphs.

A very basic tweak you typically want to apply to your graph is to change its labels by using the labs() function

salary_plot <- salary_plot +
    labs(
        x="Seniority (years since hiring)", 
        y="Salary ($)", 
        color="Department:"
    )
salary_plot

The labs() function can also be used to set title and subtitle of the plot

salary_plot <- salary_plot +
    labs(
        title="Does salary increase with seniority?",
        subtitle="Department by department evaluation"
    )
salary_plot

When you compare across many groups, it might be more convenient to plot groups in separate panels. You produce such plots by using the facet_grid() or the facet_wrap() functions:

salary_plot+
    facet_grid(~department)

salary_plot <- salary_plot+
    facet_wrap(~department)
salary_plot

You can use preset themes to quickly modify the appearances of your plot. For example

salary_plot + theme_bw()

salary_plot + theme_light()

salary_plot <- salary_plot + theme_minimal()
salary_plot

You can further customise your plot by using the theme() function. See ?theme() for the full list of options available. For example

salary_plot +
    theme(
        legend.position="bottom",        # move the legend in a different position
        panel.grid.minor=element_blank() # remove minor grid lines
    )

Since we are using panels to differentiate across departments, the legend for colors is not really needed, thus, it might be sensible to remove it

salary_plot <- salary_plot +
    theme(
        legend.position="none",  # remove the legend 
        panel.grid.minor=element_blank() # declutter the plot by removing minor grid lines
    )
salary_plot

You can also customise the colors of the plots. For example, you can use a customised palette by specifying colors via hex codes,

my_palette <- c(
    "Accounting" = "#F7B1AB",
    "Finance"    = "#807182",
    "HR"         = "#E3DDDD",
    "I.T"        = "#A45C3D",
    "Sales"      = "#5c3410ff",
    "Marketing"  = "#1C0221"
    )
salary_plot +
    scale_color_manual(values=my_palette)

You can use ready-to-use color palettes provided by ggplot2

salary_plot+
    scale_color_viridis_d()

You can also use color palettes provided by external packages, such as ggsci or ggokabeito

library(ggokabeito)
salary_plot <- salary_plot +
    scale_color_okabe_ito()
salary_plot

Nevertheless, there are many other possibilities of customisation. You can find a whole book in this regard in the resources page.

When you are satisfied with your plot, you can save it with the ggsave() function. See ?ggsave() for a list of the exportable formats.

# You can save it in many different formats
ggsave(salary_plot, path = here("salary_plot.jpg")) # save it in jpg format
ggsave(salary_plot, path = here("salary_plot.pdf")) # save it in pdf format
ggsave(salary_plot, path = here("salary_plot.png")) # save it in png format

# You can also customise dimensions to better fit in your documents/slides
ggsave(salary_plot, path = here("salary_plot_wide.jpg"), width=10, height = 5)
ggsave(salary_plot, path = here("salary_plot_square.jpg"), width=8,  height = 8)

Exercise 3

dt |> 
    group_by(department) |> 
    summarise(
        total=sum(salary),
        average=mean(salary)
    )

# A tibble: 6 × 3
  department    total average
  <chr>         <dbl>   <dbl>
1 Accounting 1856399.  80713.
2 Finance    3004418.  93888.
3 HR         2142249. 107112.
4 I.T.       3120520. 124821.
5 Marketing  2394822. 108856.
6 Sales      3049907. 108925.

Exercise 4

Exercise 4.1

To check how many employees have just been hired across all departments we can use the filter() function

dt |> 
    filter(seniority==0)

# A tibble: 11 × 6
   last_name     first_name department seniority salary    ID
   <chr>         <chr>      <chr>          <dbl>  <dbl> <dbl>
 1 al-Momin      Mu'mina    Sales              0 47359.  1009
 2 Heng          Marina     Accounting         0 42770.  1026
 3 Huynh         Alicia     Finance            0 61532.  1027
 4 Zavala        Kristina   Finance            0 61431.  1068
 5 el-Ameen      Haneef     Finance            0 58284.  1089
 6 al-Abdallah   Labeeb     Finance            0 59477.  1106
 7 Hiler         Margaret   Finance            0 60065.  1123
 8 Zheng         Brittany   I.T.               0 52634.  1126
 9 al-Jamal      Muna       Marketing          0 42664.  1128
10 Apodaca-Anaya Daniel     Marketing          0 43220.  1148
11 Novell        Lindsey    I.T.               0 43226.  1149

the command returns eleven rows, so we already know the number of new hired is 11. However, we might want to explicitely compute the exact number. In this case, we can use the n() function, which do nothing else but counting elements

dt |> 
    filter(seniority==0) |> 
    summarise(new_hires = n())

# A tibble: 1 × 1
  new_hires
      <int>
1        11

See, however, that while you might expect to get a number, R is returning you a 1x1 tibble. If you need the numeric format, you can use

dt |> 
    filter(seniority==0) |> 
    summarise(new_hires = n()) |> 
    as.numeric()

[1] 11

Exercise 4.2

To count the new hires by department, it is sufficient to add the specification of the grouping structure, that is

new_hires_table <- dt |> 
    group_by(department) |> 
    filter(seniority==0) |> 
    summarise(new_hires = n())
new_hires_table

# A tibble: 5 × 2
  department new_hires
  <chr>          <int>
1 Accounting         1
2 Finance            5
3 I.T.               2
4 Marketing          2
5 Sales              1

Obviously, if you sum the new hires by department, you still get 11, as expected

sum(new_hires_table$new_hires)

[1] 11

If you want to extract the value for the I.T. department you can use filter()

new_hires_table |> 
    filter(department=="I.T.")

# A tibble: 1 × 2
  department new_hires
  <chr>          <int>
1 I.T.               2

and if you want to extract the numeric value only, you can use

new_hires_table |> 
    filter(department=="I.T.") |> 
    select(new_hires) |> 
    as.numeric()

[1] 2

Exercise 4.3

We focus on employees of the Finance department with at least 8 years seniority. Thus, first we specify such conditions in the filtering function, and then we summarise the salary of the filtered rows

dt |> 
    filter(department=="Finance", seniority >= 8) |> 
    summarise(average_salary = mean(salary))

# A tibble: 1 × 1
  average_salary
           <dbl>
1        119800.

If you want to do it by department, and at the same time checking how many employees satisfy the seniority condition, you can use

dt |> 
    filter(seniority >= 8) |> 
    group_by(department) |> 
    summarise(
        average_salary = mean(salary), 
        number_of_employees = n()
        )

# A tibble: 6 × 3
  department average_salary number_of_employees
  <chr>               <dbl>               <int>
1 Accounting        100111.                   9
2 Finance           119800.                  15
3 HR                130238.                  11
4 I.T.              162628.                  12
5 Marketing         139627.                  12
6 Sales             136137.                  15

Exercise 5

We need to compute the median salary, the total number of employees and the number of new hires by department. To do everything it the same summarise() call, there a small “trick” for what concerns the new hires. Recall how comparisons work in terms of logical values. If you call

dt$seniority==0

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE  TRUE  TRUE FALSE

R returns a vector of TRUE and FALSE, where each element denotes if that specific row of the dataset is a new hire or not. If you call a function expecting a numeric input on such a vector, R will automatically consider TRUE as 1 and FALSE as 0. Thus

sum(dt$seniority==0)

[1] 11

returns the number of TRUEs in the vector, that is the number of new hires in the company. Therefore, we can compute such number by department by using

report_table <- dt |> 
    group_by(department) |> 
    summarise(
        med_sal=median(salary),
        total_n =n(),
        new_hires= sum(seniority==0)
    )
report_table

# A tibble: 6 × 4
  department med_sal total_n new_hires
  <chr>        <dbl>   <int>     <int>
1 Accounting  80716.      23         1
2 Finance     90454.      32         5
3 HR         110257.      20         0
4 I.T.       119080.      25         2
5 Marketing  107778.      22         2
6 Sales      106649.      28         1

Before exporting the table to Excel, you might want to rename columns in a more reader-friendly way. You can do this by means of the rename() function

excel_table <- report_table |> 
    rename(
        "Department"=department, 
        "Median salary" = med_sal,
        "Total number of employees"=total_n,
        "Number of new hires"= new_hires
        )
excel_table

# A tibble: 6 × 4
  Department `Median salary` `Total number of employees` `Number of new hires`
  <chr>                <dbl>                       <int>                 <int>
1 Accounting          80716.                          23                     1
2 Finance             90454.                          32                     5
3 HR                 110257.                          20                     0
4 I.T.               119080.                          25                     2
5 Marketing          107778.                          22                     2
6 Sales              106649.                          28                     1

When you are ready to export your table, you can call

write_xlsx(excel_table, path = here("my_table.xlsx"))

Exercise 6

ggplot(report_table, aes(x=department, y=med_sal))+
    geom_col()

Exercise 7

You can use the split() function to split a dataset according to a grouping variable. Here, we want to split the original dataset in many smaller datasets, where each smaller dataset collects employees from a specific department

dt_by_dept <- split(dt, dt$department)
typeof(dt_by_dept)

[1] "list"

the split() function returns you a list. In particular, it is a list of datasets, department by department, as you can see

dt_by_dept

$Accounting
# A tibble: 23 × 6
   last_name first_name department seniority  salary    ID
   <chr>     <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Thornock  Javonte    Accounting         6  81101.  1006
 2 Garcia    Mauricio   Accounting         8  86834.  1007
 3 Khuu      Vivian     Accounting        12 107517.  1019
 4 Miranda   Jonathan   Accounting         4  74053.  1020
 5 Lucero    Ashley     Accounting         2  55421.  1022
 6 Heng      Marina     Accounting         0  42770.  1026
 7 Crosby    Danielle   Accounting        13 111073.  1036
 8 al-Hamdan Thaamira   Accounting        10  98093.  1051
 9 al-Farag  Mukarram   Accounting         6  74256.  1060
10 Taft      Austin     Accounting         1  51476.  1067
# ℹ 13 more rows

$Finance
# A tibble: 32 × 6
   last_name first_name department seniority  salary    ID
   <chr>     <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Kierstead Tyana      Finance            3  73702.  1014
 2 Hawk      James      Finance            6  90241.  1015
 3 Mcintosh  Cleevens   Finance            8 107662.  1017
 4 Schmidt   Benjamin   Finance           13 129277.  1025
 5 Huynh     Alicia     Finance            0  61532.  1027
 6 Wiedmaier Dhipthika  Finance           13 128548.  1028
 7 Mccormick Ryan       Finance            5  80248.  1037
 8 Ngu       Hannah     Finance           15 139102.  1040
 9 Omani     Jonathan   Finance           13 131979.  1042
10 Wilson    Madison    Finance           12 126816.  1045
# ℹ 22 more rows

$HR
# A tibble: 20 × 6
   last_name       first_name department seniority  salary    ID
   <chr>           <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Whitaker        Jalen      HR                15 164507.  1002
 2 Gilroy          Sara       HR                 1  62324.  1016
 3 Dennis-Lopez    Justin     HR                14 158961.  1018
 4 Yuan            Joseph     HR                 2  62636.  1031
 5 al-Shaikh       Shakeela   HR                 1  61169.  1035
 6 Colunga         Derek      HR                12 147608.  1043
 7 Garcia Lares    Cheyenne   HR                10 122904.  1044
 8 Rogers          Alexandra  HR                 8 114061.  1047
 9 el-Din          Azza       HR                10 124099.  1054
10 Molina          Braxton    HR                 3  75562.  1061
11 Parra           Mayra      HR                 5  90524.  1062
12 Schlundt        Amanda     HR                 9 114741.  1072
13 Martinez        Janet      HR                 7  98415.  1075
14 Neppl           Samantha   HR                14 151313.  1081
15 el-Ebrahim      Rafeeda    HR                 8 113015.  1083
16 Abenes          Thao       HR                 5  89811.  1084
17 Cao             Leighton   HR                 8 107499.  1092
18 Gudino Cisneros Brittany   HR                 6  97372.  1096
19 Garcia          Rossy      HR                 8 113907.  1120
20 Cloud           Dipika     HR                 3  71823.  1142

$I.T.
# A tibble: 25 × 6
   last_name    first_name department seniority  salary    ID
   <chr>        <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Holguin      Austin     I.T.               9 138793.  1004
 2 Hoang        Eric       I.T.               6 106614.  1011
 3 Shazier      Alea       I.T.              10 148184.  1012
 4 Westrich     Rachael    I.T.               3  74821.  1013
 5 Lobato       Rashaina   I.T.               7 116607.  1030
 6 Tutt         Jerry      I.T.              15 192685.  1032
 7 Horton       Kayla      I.T.               6 108830.  1039
 8 Sherbon      Sydney     I.T.              10 146050.  1048
 9 Garcia Lamas Samuel     I.T.               4  86101.  1050
10 al-El-Sayed  Muna       I.T.              14 181644.  1052
# ℹ 15 more rows

$Marketing
# A tibble: 22 × 6
   last_name first_name department seniority  salary    ID
   <chr>     <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 al-Harron Fikra      Marketing          4  72654.  1001
 2 Martinez  Yessica    Marketing          9 108185.  1023
 3 Green     Sarye      Marketing         15 154446.  1029
 4 Wall      Kayla      Marketing          4  79165.  1063
 5 Wynter    Michael    Marketing          2  64957.  1065
 6 Briseno   Dominick   Marketing         15 154728.  1073
 7 Hennefeld Mitchell   Marketing          6  85930.  1074
 8 Mccarthy  Angelica   Marketing         12 137013.  1080
 9 Diltz     Zachary    Marketing          4  75023.  1087
10 Magor     Jacob      Marketing         15 156230.  1093
# ℹ 12 more rows

$Sales
# A tibble: 28 × 6
   last_name first_name department seniority  salary    ID
   <chr>     <chr>      <chr>          <dbl>   <dbl> <dbl>
 1 Pillow    Cleevens   Sales              7 102665.  1003
 2 Bertsch   Damon      Sales             12 140699.  1005
 3 Cheung    Timothy    Sales              7 100437.  1008
 4 al-Momin  Mu'mina    Sales              0  47359.  1009
 5 Vogltanz  Jesse      Sales             12 141815.  1010
 6 Hernandez Chanelle   Sales             14 155107.  1021
 7 Pyo       Courtney   Sales              3  70371.  1024
 8 Perkins   Senaiet    Sales              7 104424.  1033
 9 Smith     Lawrence   Sales              1  59669.  1034
10 Freeman   Ali        Sales              3  72765.  1038
# ℹ 18 more rows

If you input a list in write_xlsx(), it creates an excel file with one sheet for each dataset in the list

write_xlsx(dt_by_dept, path=here("multi_sheet.xlsx"))

Practice 2

Setup

I saved my R code in a script called practice2.R which is stored in the project folder.

library(tidyverse)
library(here)
library(readxl)
library(writexl)

# declare where you are in your project folder
i_am("practice2.R")

# construct the path to the new data
path2 <- here("data", "employees_company2.xlsx")

# read the data and store it in dt2
dt2 <- read_xlsx(path=path2)

Exercise 1

The new dataset is very similar to the previous one, but column names exhibit inconsistencies: spaces, capitalised initial letters, capitalised words etc.

dt2

# A tibble: 200 × 6
   `Last NAME` `First-name` department Seniority  salary    ID
   <chr>       <chr>        <chr>          <dbl>   <dbl> <dbl>
 1 Sandoval    Jerika       Sales              1  62222.  1001
 2 Lucas       Mark         HR                11  73847.  1002
 3 Pasang      Taylor       Sales              9 126987.  1003
 4 Peters      Christopher  Sales              2  70845.  1004
 5 Thomas      Noel         HR                10     NA   1005
 6 Geshick     Angel        HR                 6  58865.  1006
 7 White       Alton        HR                12  72237.  1007
 8 Parkhill    Seth         Marketing         15 133987.  1008
 9 Balogh      Joshua       Finance           10 136700.  1009
10 al-Shahin   Tuhfa        Marketing         13 119953.  1010
# ℹ 190 more rows

It is not a problem per se, but it might lead to syntax errors (for example, next time you’ll need to extract the seniority variable from the dataset, will you remember that Seniority has capitalised first letter while that’s not the case for salary?). To avoid confusion, it is convenient to stick to some naming standards, such as the snake_case, where no capitalisation is allowed and spaces are replaced by underscores.

You can change a variable name manually

dt2 |> 
    rename(last_name = `Last NAME`)

# A tibble: 200 × 6
   last_name `First-name` department Seniority  salary    ID
   <chr>     <chr>        <chr>          <dbl>   <dbl> <dbl>
 1 Sandoval  Jerika       Sales              1  62222.  1001
 2 Lucas     Mark         HR                11  73847.  1002
 3 Pasang    Taylor       Sales              9 126987.  1003
 4 Peters    Christopher  Sales              2  70845.  1004
 5 Thomas    Noel         HR                10     NA   1005
 6 Geshick   Angel        HR                 6  58865.  1006
 7 White     Alton        HR                12  72237.  1007
 8 Parkhill  Seth         Marketing         15 133987.  1008
 9 Balogh    Joshua       Finance           10 136700.  1009
10 al-Shahin Tuhfa        Marketing         13 119953.  1010
# ℹ 190 more rows

or you can leverage some convenient helper functions, such as the clean_names() function from the janitor package

library(janitor)
dt2 <- dt2 |> 
    clean_names()
dt2

# A tibble: 200 × 6
   last_name first_name  department seniority  salary    id
   <chr>     <chr>       <chr>          <dbl>   <dbl> <dbl>
 1 Sandoval  Jerika      Sales              1  62222.  1001
 2 Lucas     Mark        HR                11  73847.  1002
 3 Pasang    Taylor      Sales              9 126987.  1003
 4 Peters    Christopher Sales              2  70845.  1004
 5 Thomas    Noel        HR                10     NA   1005
 6 Geshick   Angel       HR                 6  58865.  1006
 7 White     Alton       HR                12  72237.  1007
 8 Parkhill  Seth        Marketing         15 133987.  1008
 9 Balogh    Joshua      Finance           10 136700.  1009
10 al-Shahin Tuhfa       Marketing         13 119953.  1010
# ℹ 190 more rows

Exercise 2

summary(dt2)

  last_name          first_name         department          seniority     
 Length:200         Length:200         Length:200         Min.   : 0.000  
 Class :character   Class :character   Class :character   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 8.000  
                                                          Mean   : 7.665  
                                                          3rd Qu.:11.000  
                                                          Max.   :15.000  
                                                                          
     salary             id      
 Min.   : 44817   Min.   :1001  
 1st Qu.: 70802   1st Qu.:1051  
 Median : 98066   Median :1100  
 Mean   : 98716   Mean   :1100  
 3rd Qu.:121850   3rd Qu.:1150  
 Max.   :187488   Max.   :1200  
 NA's   :20

On top of returning summary statistics such as mean and quantiles, the summary() function also returns you the number of missing values (NA) observed on each column. In this case, we get 20 NAs on the salary variable

We are asked to impute them by using the median salary of the departmet of reference. We can do this by combining group_by(), mutate() and replace()

dt2_imputed <- dt2 |> 
    group_by(department) |> 
    mutate(salary = replace(salary,  is.na(salary),  median(salary, na.rm=TRUE)))
dt2_imputed

# A tibble: 200 × 6
# Groups:   department [6]
   last_name first_name  department seniority  salary    id
   <chr>     <chr>       <chr>          <dbl>   <dbl> <dbl>
 1 Sandoval  Jerika      Sales              1  62222.  1001
 2 Lucas     Mark        HR                11  73847.  1002
 3 Pasang    Taylor      Sales              9 126987.  1003
 4 Peters    Christopher Sales              2  70845.  1004
 5 Thomas    Noel        HR                10  66857.  1005
 6 Geshick   Angel       HR                 6  58865.  1006
 7 White     Alton       HR                12  72237.  1007
 8 Parkhill  Seth        Marketing         15 133987.  1008
 9 Balogh    Joshua      Finance           10 136700.  1009
10 al-Shahin Tuhfa       Marketing         13 119953.  1010
# ℹ 190 more rows

This new version of the data contains no missing values

summary(dt2_imputed)

  last_name          first_name         department          seniority     
 Length:200         Length:200         Length:200         Min.   : 0.000  
 Class :character   Class :character   Class :character   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 8.000  
                                                          Mean   : 7.665  
                                                          3rd Qu.:11.000  
                                                          Max.   :15.000  
     salary             id      
 Min.   : 44817   Min.   :1001  
 1st Qu.: 72020   1st Qu.:1051  
 Median :100575   Median :1100  
 Mean   : 99095   Mean   :1100  
 3rd Qu.:120793   3rd Qu.:1150  
 Max.   :187488   Max.   :1200