+ - 0:00:00
Notes for current slide
Notes for next slide



Data Wrangling:
Part I

Dr. Mine Dogucu

1 / 46

Review

2 / 46

Goals

  • Subsetting data frames
  • Changing variables
3 / 46

Subsetting data frames

4 / 46
glimpse(lapd)
Rows: 68,564
Columns: 35
$ `Row ID` <chr> "3-1000027830ctFu", "3-1000155488ctFu",…
$ Year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 201
$ `Department Title` <chr> "Police (LAPD)", "Police (LAPD)", "Poli…
$ `Payroll Department` <dbl> 4301, 4302, 4301, 4301, 4302, 4302, 430…
$ `Record Number` <dbl> 1000027830, 1000155488, 1000194958, 100…
$ `Job Class Title` <chr> "Police Detective II", "Clerk Typist", …
$ `Employment Type` <chr> "Full Time", "Full Time", "Full Time", …
$ `Hourly or Event Rate` <dbl> 53.16, 23.77, 60.80, 60.98, 45.06, 34.4…
$ `Projected Annual Salary` <dbl> 110998.08, 49623.67, 126950.40, 127326.…
$ `Q1 Payments` <dbl> 24931.20, 11343.96, 24184.00, 29391.20,…
$ `Q2 Payments` <dbl> 29181.61, 13212.37, 28327.20, 36591.20,…
$ `Q3 Payments` <dbl> 26545.80, 11508.36, 28744.20, 32904.81,…
$ `Q4 Payments` <dbl> 29605.30, 13442.53, 33224.88, 37234.03,…
$ `Payments Over Base Pay` <dbl> 4499.12, 1844.82, 13192.43, 18034.53, 1…
$ `% Over Base Pay` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ `Total Payments` <dbl> 110263.91, 49507.22, 114480.28, 136121.…
$ `Base Pay` <dbl> 105764.79, 47662.40, 101287.85, 118086.…
$ `Permanent Bonus Pay` <dbl> 3174.12, 0.00, 7363.95, 7086.67, 0.00, …
$ `Longevity Bonus Pay` <dbl> 0.00, 1310.82, 0.00, 0.00, 1251.19, 172…
$ `Temporary Bonus Pay` <dbl> 1325.00, 0.00, 1205.00, 1325.00, 125.00…
$ `Lump Sum Pay` <dbl> 0.00, 0.00, 2133.18, 0.00, 2068.80, 0.0…
$ `Overtime Pay` <dbl> 0.00, 0.00, 4424.32, 9839.33, 0.00, 0.0…
$ `Other Pay & Adjustments` <dbl> 0.00, 534.00, -1934.02, -216.47, -2068.…
$ `Other Pay (Payroll Explorer)` <dbl> 4499.12, 1844.82, 8768.11, 8195.20, 137…
$ MOU <chr> "24", "3", "24", "24", "12", "3", "24",…
$ `MOU Title` <chr> "POLICE OFFICERS UNIT", "CLERICAL UNIT"…
$ `FMS Department` <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,…
$ `Job Class` <chr> "2223", "1358", "2227", "2232", "1839",…
$ `Pay Grade` <chr> "2", "0", "1", "1", "0", "2", "3", "1",…
$ `Average Health Cost` <dbl> 11651.40, 10710.24, 11651.40, 11651.40,…
$ `Average Dental Cost` <dbl> 898.08, 405.24, 898.08, 898.08, 405.24,…
$ `Average Basic Life` <dbl> 191.04, 11.40, 191.04, 191.04, 11.40, 1…
$ `Average Benefit Cost` <dbl> 12740.52, 11126.88, 12740.52, 12740.52,…
$ `Benefits Plan` <chr> "Police", "City", "Police", "Police", "
$ `Job Class Link` <chr> "http://per.lacity.org/perspecs/2223.pd…
5 / 46
lapd <- clean_names(lapd)
glimpse(lapd)
Rows: 68,564
Columns: 35
$ row_id <chr> "3-1000027830ctFu", "3-1000155488ctFu", "3-…
$ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ department_title <chr> "Police (LAPD)", "Police (LAPD)", "Police (…
$ payroll_department <dbl> 4301, 4302, 4301, 4301, 4302, 4302, 4301, 4
$ record_number <dbl> 1000027830, 1000155488, 1000194958, 1000232
$ job_class_title <chr> "Police Detective II", "Clerk Typist", "Pol…
$ employment_type <chr> "Full Time", "Full Time", "Full Time", "Ful…
$ hourly_or_event_rate <dbl> 53.16, 23.77, 60.80, 60.98, 45.06, 34.42, 4
$ projected_annual_salary <dbl> 110998.08, 49623.67, 126950.40, 127326.24, …
$ q1_payments <dbl> 24931.20, 11343.96, 24184.00, 29391.20, 208
$ q2_payments <dbl> 29181.61, 13212.37, 28327.20, 36591.20, 241
$ q3_payments <dbl> 26545.80, 11508.36, 28744.20, 32904.81, 215
$ q4_payments <dbl> 29605.30, 13442.53, 33224.88, 37234.03, 252
$ payments_over_base_pay <dbl> 4499.12, 1844.82, 13192.43, 18034.53, 1376.
$ percent_over_base_pay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ total_payments <dbl> 110263.91, 49507.22, 114480.28, 136121.24, …
$ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, …
$ permanent_bonus_pay <dbl> 3174.12, 0.00, 7363.95, 7086.67, 0.00, 0.00
$ longevity_bonus_pay <dbl> 0.00, 1310.82, 0.00, 0.00, 1251.19, 1726.16
$ temporary_bonus_pay <dbl> 1325.00, 0.00, 1205.00, 1325.00, 125.00, 68
$ lump_sum_pay <dbl> 0.00, 0.00, 2133.18, 0.00, 2068.80, 0.00, 0
$ overtime_pay <dbl> 0.00, 0.00, 4424.32, 9839.33, 0.00, 0.00, 4
$ other_pay_adjustments <dbl> 0.00, 534.00, -1934.02, -216.47, -2068.80, …
$ other_pay_payroll_explorer <dbl> 4499.12, 1844.82, 8768.11, 8195.20, 1376.19
$ mou <chr> "24", "3", "24", "24", "12", "3", "24", "24…
$ mou_title <chr> "POLICE OFFICERS UNIT", "CLERICAL UNIT", "P…
$ fms_department <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,…
$ job_class <chr> "2223", "1358", "2227", "2232", "1839", "22…
$ pay_grade <chr> "2", "0", "1", "1", "0", "2", "3", "1", "B"…
$ average_health_cost <dbl> 11651.40, 10710.24, 11651.40, 11651.40, 107…
$ average_dental_cost <dbl> 898.08, 405.24, 898.08, 898.08, 405.24, 405…
$ average_basic_life <dbl> 191.04, 11.40, 191.04, 191.04, 11.40, 11.40…
$ average_benefit_cost <dbl> 12740.52, 11126.88, 12740.52, 12740.52, 111…
$ benefits_plan <chr> "Police", "City", "Police", "Police", "City…
$ job_class_link <chr> "http://per.lacity.org/perspecs/2223.pdf", …
6 / 46

subsetting variables/columns

7 / 46

subsetting variables/columns

select()

8 / 46

subsetting observations/rows

slice() and filter()

9 / 46

select is used to select certain variables in the data frame.

select(lapd, year, base_pay)
# A tibble: 68,564 × 2
year base_pay
<dbl> <dbl>
1 2013 105765.
2 2013 47662.
3 2013 101288.
4 2013 118087.
5 2013 90322.
6 2013 62770.
# … with 68,558 more rows
10 / 46

select is used to select certain variables in the data frame.

select(lapd, year, base_pay)
# A tibble: 68,564 × 2
year base_pay
<dbl> <dbl>
1 2013 105765.
2 2013 47662.
3 2013 101288.
4 2013 118087.
5 2013 90322.
6 2013 62770.
# … with 68,558 more rows
lapd %>%
select(year, base_pay)
# A tibble: 68,564 × 2
year base_pay
<dbl> <dbl>
1 2013 105765.
2 2013 47662.
3 2013 101288.
4 2013 118087.
5 2013 90322.
6 2013 62770.
# … with 68,558 more rows
11 / 46

select can also be used to drop certain variables if used with a negative sign.

select(lapd, -row_id, -department_title)
# A tibble: 68,564 × 33
year payroll_department record_number job_class_title employment_type
<dbl> <dbl> <dbl> <chr> <chr>
1 2013 4301 1000027830 Police Detective II Full Time
2 2013 4302 1000155488 Clerk Typist Full Time
3 2013 4301 1000194958 Police Sergeant I Full Time
4 2013 4301 1000232317 Police Lieutenant I Full Time
5 2013 4302 1000329284 Principal Storekeeper Full Time
6 2013 4302 1001124320 Police Service Represe… Full Time
# … with 68,558 more rows, and 28 more variables: hourly_or_event_rate <dbl>,
# projected_annual_salary <dbl>, q1_payments <dbl>, q2_payments <dbl>,
# q3_payments <dbl>, q4_payments <dbl>, payments_over_base_pay <dbl>,
# percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>,
# permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>,
# temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>,
# other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, mou <chr>, …
12 / 46

Selection helpers

starts_with()
ends_with()
contains()

13 / 46
select(lapd, starts_with("q"))
# A tibble: 68,564 × 4
q1_payments q2_payments q3_payments q4_payments
<dbl> <dbl> <dbl> <dbl>
1 24931. 29182. 26546. 29605.
2 11344. 13212. 11508. 13443.
3 24184 28327. 28744. 33225.
4 29391. 36591. 32905. 37234.
5 20813 24136 21518. 25231.
6 16057. 17927. 14150. 17052.
# … with 68,558 more rows
14 / 46
select(lapd, ends_with("pay"))
# A tibble: 68,564 × 8
payments_over_ba… percent_over_bas… base_pay permanent_bonus… longevity_bonus…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4499. 0 105765. 3174. 0
2 1845. 0 47662. 0 1311.
3 13192. 0 101288. 7364. 0
4 18035. 0 118087. 7087. 0
5 1376. 0 90322. 0 1251.
6 2415. 0 62770. 0 1726.
# … with 68,558 more rows, and 3 more variables: temporary_bonus_pay <dbl>,
# lump_sum_pay <dbl>, overtime_pay <dbl>
15 / 46
select(lapd, contains("pay"))
# A tibble: 68,564 × 17
payroll_department q1_payments q2_payments q3_payments q4_payments
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4301 24931. 29182. 26546. 29605.
2 4302 11344. 13212. 11508. 13443.
3 4301 24184 28327. 28744. 33225.
4 4301 29391. 36591. 32905. 37234.
5 4302 20813 24136 21518. 25231.
6 4302 16057. 17927. 14150. 17052.
# … with 68,558 more rows, and 12 more variables: payments_over_base_pay <dbl>,
# percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>,
# permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>,
# temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>,
# other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>,
# pay_grade <chr>
16 / 46

subsetting variables/columns

17 / 46

subsetting variables/columns

select()

18 / 46

subsetting observations/rows

slice() and filter()

19 / 46

slice() subsetting rows based on a row number.

The data below include all the rows from third to seventh. Including third and seventh.

slice(lapd, 3:7)
# A tibble: 5 × 35
row_id year department_title payroll_departm… record_number job_class_title
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 3-1000194958ctFu 2013 Police (LAPD) 4301 1000194958 Police Sergean…
2 3-1000232317ctFu 2013 Police (LAPD) 4301 1000232317 Police Lieuten…
3 3-1000329284ctFu 2013 Police (LAPD) 4302 1000329284 Principal Stor…
4 3-1001124320ctFu 2013 Police (LAPD) 4302 1001124320 Police Service…
5 3-1001221822ctFu 2013 Police (LAPD) 4301 1001221822 Police Officer…
# … with 29 more variables: employment_type <chr>, hourly_or_event_rate <dbl>,
# projected_annual_salary <dbl>, q1_payments <dbl>, q2_payments <dbl>,
# q3_payments <dbl>, q4_payments <dbl>, payments_over_base_pay <dbl>,
# percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>,
# permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>,
# temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>,
# other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, mou <chr>, …
20 / 46

slice() subsetting rows based on a row number.

The data below include all the rows from third to seventh. Including third and seventh.

slice(lapd, 3:7)
# A tibble: 5 × 35
row_id year department_title payroll_departm… record_number job_class_title
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 3-1000194958ctFu 2013 Police (LAPD) 4301 1000194958 Police Sergean…
2 3-1000232317ctFu 2013 Police (LAPD) 4301 1000232317 Police Lieuten…
3 3-1000329284ctFu 2013 Police (LAPD) 4302 1000329284 Principal Stor…
4 3-1001124320ctFu 2013 Police (LAPD) 4302 1001124320 Police Service…
5 3-1001221822ctFu 2013 Police (LAPD) 4301 1001221822 Police Officer…
# … with 29 more variables: employment_type <chr>, hourly_or_event_rate <dbl>,
# projected_annual_salary <dbl>, q1_payments <dbl>, q2_payments <dbl>,
# q3_payments <dbl>, q4_payments <dbl>, payments_over_base_pay <dbl>,
# percent_over_base_pay <dbl>, total_payments <dbl>, base_pay <dbl>,
# permanent_bonus_pay <dbl>, longevity_bonus_pay <dbl>,
# temporary_bonus_pay <dbl>, lump_sum_pay <dbl>, overtime_pay <dbl>,
# other_pay_adjustments <dbl>, other_pay_payroll_explorer <dbl>, mou <chr>, …

filter() subsetting rows based on a condition.

The data below includes rows when the recorded year is 2018.

filter(lapd, year == 2018)
# A tibble: 14,824 × 35
row_id year department_title payroll_departm… record_number job_class_title
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 8-1000027830ctFu 2018 Police (LAPD) 4301 1000027830 Police Detecti…
2 8-1000194958ctFu 2018 Police (LAPD) 4301 1000194958 Police Sergean…
3 8-1000232317ctFu 2018 Police (LAPD) 4301 1000232317 Police Lieuten…
4 8-1001124320ctFu 2018 Police (LAPD) 4302 1001124320 Police Service…
5 8-1001221822ctFu 2018 Police (LAPD) 4301 1001221822 Police Officer…
6 8-1001317832ctFu 2018 Police (LAPD) 4301 1001317832 Police Officer…
# … with 14,818 more rows, and 29 more variables: employment_type <chr>,
# hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
# q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>,
# payments_over_base_pay <dbl>, percent_over_base_pay <dbl>,
# total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>,
# longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>,
# overtime_pay <dbl>, other_pay_adjustments <dbl>, …
21 / 46

Relational Operators in R

Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

Logical Operators in R

Operator Description
& and
| or

Miscellaneous Operators

Operator Description
: creates a series of numbers
%in% checks if an element is in a vector
%*% matrix multiplication
22 / 46

Q. According to datausa.io Los Angeles had a median household income of $62474 in 2018. How many LAPD staff members had a base pay higher than $62474 in year 2018 according to this data?

Median household income is not the same thing as median employee income. Our aim is data wrangling and not necessarily statistical analysis for now.

23 / 46
lapd %>%
filter(year == 2018 & base_pay > 62474)
# A tibble: 11,690 × 35
row_id year department_title payroll_departm… record_number job_class_title
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 8-1000027830ctFu 2018 Police (LAPD) 4301 1000027830 Police Detecti…
2 8-1000194958ctFu 2018 Police (LAPD) 4301 1000194958 Police Sergean…
3 8-1000232317ctFu 2018 Police (LAPD) 4301 1000232317 Police Lieuten…
4 8-1001124320ctFu 2018 Police (LAPD) 4302 1001124320 Police Service…
5 8-1001221822ctFu 2018 Police (LAPD) 4301 1001221822 Police Officer…
6 8-1001317832ctFu 2018 Police (LAPD) 4301 1001317832 Police Officer…
# … with 11,684 more rows, and 29 more variables: employment_type <chr>,
# hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
# q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>,
# payments_over_base_pay <dbl>, percent_over_base_pay <dbl>,
# total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>,
# longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>,
# overtime_pay <dbl>, other_pay_adjustments <dbl>, …
24 / 46
lapd %>%
filter(year == 2018 & base_pay > 62474) %>%
nrow()
[1] 11690
25 / 46

Q. How many observations are available between 2013 and 2015 including 2013 and 2015?

26 / 46

Q. How many observations are available between 2013 and 2015 including 2013 and 2015?

lapd %>%
filter(year >= 2013 & year <= 2015)
# A tibble: 40,227 × 35
row_id year department_title payroll_departm… record_number job_class_title
<chr> <dbl> <chr> <dbl> <dbl> <chr>
1 3-1000027830ctFu 2013 Police (LAPD) 4301 1000027830 Police Detecti…
2 3-1000155488ctFu 2013 Police (LAPD) 4302 1000155488 Clerk Typist
3 3-1000194958ctFu 2013 Police (LAPD) 4301 1000194958 Police Sergean…
4 3-1000232317ctFu 2013 Police (LAPD) 4301 1000232317 Police Lieuten…
5 3-1000329284ctFu 2013 Police (LAPD) 4302 1000329284 Principal Stor…
6 3-1001124320ctFu 2013 Police (LAPD) 4302 1001124320 Police Service…
# … with 40,221 more rows, and 29 more variables: employment_type <chr>,
# hourly_or_event_rate <dbl>, projected_annual_salary <dbl>,
# q1_payments <dbl>, q2_payments <dbl>, q3_payments <dbl>, q4_payments <dbl>,
# payments_over_base_pay <dbl>, percent_over_base_pay <dbl>,
# total_payments <dbl>, base_pay <dbl>, permanent_bonus_pay <dbl>,
# longevity_bonus_pay <dbl>, temporary_bonus_pay <dbl>, lump_sum_pay <dbl>,
# overtime_pay <dbl>, other_pay_adjustments <dbl>, …
27 / 46

Q. How many observations are available between 2013 and 2015 including 2013 and 2015?

lapd %>%
filter(year >= 2013 & year <= 2015) %>%
nrow()
[1] 40227
28 / 46

Q. How many LAPD staff were employed full time in 2018?

lapd %>%
filter(employment_type == "Full Time" & year == 2018) %>%
nrow()
[1] 14664
29 / 46

We have done all sorts of selections, slicing, filtering on lapd but it has not changed at all. Why do you think so?

glimpse(lapd)
Rows: 68,564
Columns: 35
$ row_id <chr> "3-1000027830ctFu", "3-1000155488ctFu", "3-…
$ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ department_title <chr> "Police (LAPD)", "Police (LAPD)", "Police (…
$ payroll_department <dbl> 4301, 4302, 4301, 4301, 4302, 4302, 4301, 4
$ record_number <dbl> 1000027830, 1000155488, 1000194958, 1000232
$ job_class_title <chr> "Police Detective II", "Clerk Typist", "Pol…
$ employment_type <chr> "Full Time", "Full Time", "Full Time", "Ful…
$ hourly_or_event_rate <dbl> 53.16, 23.77, 60.80, 60.98, 45.06, 34.42, 4
$ projected_annual_salary <dbl> 110998.08, 49623.67, 126950.40, 127326.24, …
$ q1_payments <dbl> 24931.20, 11343.96, 24184.00, 29391.20, 208
$ q2_payments <dbl> 29181.61, 13212.37, 28327.20, 36591.20, 241
$ q3_payments <dbl> 26545.80, 11508.36, 28744.20, 32904.81, 215
$ q4_payments <dbl> 29605.30, 13442.53, 33224.88, 37234.03, 252
$ payments_over_base_pay <dbl> 4499.12, 1844.82, 13192.43, 18034.53, 1376.
$ percent_over_base_pay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ total_payments <dbl> 110263.91, 49507.22, 114480.28, 136121.24, …
$ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, …
$ permanent_bonus_pay <dbl> 3174.12, 0.00, 7363.95, 7086.67, 0.00, 0.00
$ longevity_bonus_pay <dbl> 0.00, 1310.82, 0.00, 0.00, 1251.19, 1726.16
$ temporary_bonus_pay <dbl> 1325.00, 0.00, 1205.00, 1325.00, 125.00, 68
$ lump_sum_pay <dbl> 0.00, 0.00, 2133.18, 0.00, 2068.80, 0.00, 0
$ overtime_pay <dbl> 0.00, 0.00, 4424.32, 9839.33, 0.00, 0.00, 4
$ other_pay_adjustments <dbl> 0.00, 534.00, -1934.02, -216.47, -2068.80, …
$ other_pay_payroll_explorer <dbl> 4499.12, 1844.82, 8768.11, 8195.20, 1376.19
$ mou <chr> "24", "3", "24", "24", "12", "3", "24", "24…
$ mou_title <chr> "POLICE OFFICERS UNIT", "CLERICAL UNIT", "P…
$ fms_department <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,…
$ job_class <chr> "2223", "1358", "2227", "2232", "1839", "22…
$ pay_grade <chr> "2", "0", "1", "1", "0", "2", "3", "1", "B"…
$ average_health_cost <dbl> 11651.40, 10710.24, 11651.40, 11651.40, 107…
$ average_dental_cost <dbl> 898.08, 405.24, 898.08, 898.08, 405.24, 405…
$ average_basic_life <dbl> 191.04, 11.40, 191.04, 191.04, 11.40, 11.40…
$ average_benefit_cost <dbl> 12740.52, 11126.88, 12740.52, 12740.52, 111…
$ benefits_plan <chr> "Police", "City", "Police", "Police", "City…
$ job_class_link <chr> "http://per.lacity.org/perspecs/2223.pdf", …
30 / 46

Moving forward we are only going to focus on year 2018, and use job_class_title, employment_type, and base_pay. Let's clean our data accordingly and move on with the smaller lapd data that we need.

31 / 46
lapd %>%
filter(year == 2018) %>%
select(job_class_title,
employment_type,
base_pay)
# A tibble: 14,824 × 3
job_class_title employment_type base_pay
<chr> <chr> <dbl>
1 Police Detective II Full Time 119322.
2 Police Sergeant I Full Time 113271.
3 Police Lieutenant II Full Time 148116
4 Police Service Representative II Full Time 78677.
5 Police Officer III Full Time 109374.
6 Police Officer II Full Time 95002.
# … with 14,818 more rows
32 / 46
lapd <-
lapd %>%
filter(year == 2018) %>%
select(job_class_title,
employment_type,
base_pay)
33 / 46
glimpse(lapd)
Rows: 14,824
Columns: 3
$ job_class_title <chr> "Police Detective II", "Police Sergeant I", "Police Li…
$ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F
$ base_pay <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.63, …

Goal:

Create a new variable called base_pay_k that represents base_pay in thousand dollars.

34 / 46
lapd %>%
mutate(base_pay_k = base_pay/1000)
# A tibble: 14,824 × 4
job_class_title employment_type base_pay base_pay_k
<chr> <chr> <dbl> <dbl>
1 Police Detective II Full Time 119322. 119.
2 Police Sergeant I Full Time 113271. 113.
3 Police Lieutenant II Full Time 148116 148.
4 Police Service Representative II Full Time 78677. 78.7
5 Police Officer III Full Time 109374. 109.
6 Police Officer II Full Time 95002. 95.0
# … with 14,818 more rows
35 / 46
glimpse(lapd)
Rows: 14,824
Columns: 3
$ job_class_title <chr> "Police Detective II", "Police Sergeant I", "Police Li…
$ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F
$ base_pay <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.63, …

Goal:

Create a new variable called base_pay_level which has Less Than 0, No Income, Less than Median and Greater than 0 and Greater than Median. We will consider $62474 as the median (from previous lecture).

36 / 46

Let's first check to see there is anyone earning exactly the median value.

lapd %>%
filter(base_pay == 62474)
# A tibble: 0 × 3
# … with 3 variables: job_class_title <chr>, employment_type <chr>,
# base_pay <dbl>
37 / 46
lapd %>%
mutate(base_pay_level = case_when(
base_pay < 0 ~ "Less than 0",
base_pay == 0 ~ "No Income",
base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
base_pay > 62474 ~ "Greater than Median"))
# A tibble: 14,824 × 4
job_class_title employment_type base_pay base_pay_level
<chr> <chr> <dbl> <chr>
1 Police Detective II Full Time 119322. Greater than Median
2 Police Sergeant I Full Time 113271. Greater than Median
3 Police Lieutenant II Full Time 148116 Greater than Median
4 Police Service Representative II Full Time 78677. Greater than Median
5 Police Officer III Full Time 109374. Greater than Median
6 Police Officer II Full Time 95002. Greater than Median
# … with 14,818 more rows
38 / 46

We can't really see what we have created

lapd %>%
mutate(base_pay_level = case_when(
base_pay < 0 ~ "Less than 0",
base_pay == 0 ~ "No Income",
base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
base_pay > 62474 ~ "Greater than Median")) %>%
select(base_pay_level)
# A tibble: 14,824 × 1
base_pay_level
<chr>
1 Greater than Median
2 Greater than Median
3 Greater than Median
4 Greater than Median
5 Greater than Median
6 Greater than Median
# … with 14,818 more rows
39 / 46

We can use pipes with ggplot too!

lapd %>%
mutate(base_pay_level = case_when(
base_pay < 0 ~ "Less than 0",
base_pay == 0 ~ "No Income",
base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
base_pay > 62474 ~ "Greater than Median")) %>%
select(base_pay_level) %>%
ggplot(aes(x = base_pay_level)) +
geom_bar()

40 / 46
glimpse(lapd)
Rows: 14,824
Columns: 3
$ job_class_title <chr> "Police Detective II", "Police Sergeant I", "Police Li…
$ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F
$ base_pay <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.63, …

Goal:

Make job_class_title and employment_type factor variables.

41 / 46
lapd %>%
mutate(employment_type = as.factor(employment_type),
job_class_title = as.factor(job_class_title))
# A tibble: 14,824 × 3
job_class_title employment_type base_pay
<fct> <fct> <dbl>
1 Police Detective II Full Time 119322.
2 Police Sergeant I Full Time 113271.
3 Police Lieutenant II Full Time 148116
4 Police Service Representative II Full Time 78677.
5 Police Officer III Full Time 109374.
6 Police Officer II Full Time 95002.
# … with 14,818 more rows
42 / 46

as.factor() - makes a vector factor
as.numeric() - makes a vector numeric
as.integer() - makes a vector integer
as.double() - makes a vector double
as.character() - makes a vector character

43 / 46

Once again we did not "save" anything into lapd. As we work on data cleaning it makes sense not to "save" the data frames. Once we see the final data frame we want then we can "save" (i.e. overwrite) it.

44 / 46

In your lecture notes, you can do all the changes in this lecture in one long set of piped code. That's the beauty of piping!

lapd <-
lapd %>%
clean_names() %>%
filter(year == 2018) %>%
select(job_class_title,
employment_type,
base_pay) %>%
mutate(employment_type = as.factor(employment_type),
job_class_title = as.factor(job_class_title),
base_pay_level = case_when(
base_pay < 0 ~ "Less than 0",
base_pay == 0 ~ "No Income",
base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
base_pay > 62474 ~ "Greater than Median"))
45 / 46

Word of caution

The functions clean_names(), select(), filter(), mutate() all take a data frame as the first argument. Even though we do not see it, the data frame is piped through from the previous step of code at each step. When we use these functions without the %>% we have to include the data frame explicitly.

Data frame is used as the first argument

clean_names(lapd)
# A tibble: 14,824 × 3
job_class_title employment_type base_pay
<chr> <chr> <dbl>
1 Police Detective II Full Time 119322.
2 Police Sergeant I Full Time 113271.
3 Police Lieutenant II Full Time 148116
4 Police Service Representative II Full Time 78677.
5 Police Officer III Full Time 109374.
6 Police Officer II Full Time 95002.
# … with 14,818 more rows

Data frame is piped

lapd %>%
clean_names()
# A tibble: 14,824 × 3
job_class_title employment_type base_pay
<chr> <chr> <dbl>
1 Police Detective II Full Time 119322.
2 Police Sergeant I Full Time 113271.
3 Police Lieutenant II Full Time 148116
4 Police Service Representative II Full Time 78677.
5 Police Officer III Full Time 109374.
6 Police Officer II Full Time 95002.
# … with 14,818 more rows
46 / 46

Review

2 / 46
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow