+ - 0:00:00
Notes for current slide
Notes for next slide



Data Wrangling:
Part II

Dr. Mine Dogucu

1 / 29

Review

Quiz

2 / 29

Goals

  • Review aggregating data
  • Aggregating group data
  • Data joins
  • Lots of data wrangling practice
3 / 29

Data

Observations

Aggregate Data

Summaries of observations

4 / 29

Aggregating Categorical Data

5 / 29

Categorical data are summarized with counts or proportions

6 / 29
lapd %>%
count(employment_type)
# A tibble: 3 × 2
employment_type n
<fct> <int>
1 Full Time 14664
2 Part Time 132
3 Per Event 28
7 / 29
lapd %>%
count(employment_type) %>%
mutate(prop = n/sum(n))
# A tibble: 3 × 3
employment_type n prop
<fct> <int> <dbl>
1 Full Time 14664 0.989
2 Part Time 132 0.00890
3 Per Event 28 0.00189
8 / 29

Mean

summarize(lapd,
mean_base_pay = mean(base_pay))
# A tibble: 1 × 1
mean_base_pay
<dbl>
1 85149.
9 / 29

Recall that we use multiple functions such as mean() and median within the summarize function

summarize(lapd,
mean_base_pay = mean(base_pay),
med_base_pay = median(base_pay))
# A tibble: 1 × 2
mean_base_pay med_base_pay
<dbl> <dbl>
1 85149. 97601.
10 / 29

Aggregating Data by Groups

11 / 29

group_by()

group_by() separates the data frame by the groups. Any action following group_by() will be completed for each group separately.

12 / 29

Q. What is the median salary for each employment type?

13 / 29
lapd %>%
group_by(employment_type)
# A tibble: 14,824 × 4
# Groups: employment_type [3]
job_class_title employment_type base_pay base_pay_level
<fct> <fct> <dbl> <chr>
1 Police Detective II Full Time 119322. Greater than Median
2 Police Sergeant I Full Time 113271. Greater than Median
3 Police Lieutenant II Full Time 148116 Greater than Median
4 Police Service Representative II Full Time 78677. Greater than Median
5 Police Officer III Full Time 109374. Greater than Median
6 Police Officer II Full Time 95002. Greater than Median
# … with 14,818 more rows
14 / 29
lapd %>%
group_by(employment_type) %>%
summarize(med_base_pay = median(base_pay))
# A tibble: 3 × 2
employment_type med_base_pay
<fct> <dbl>
1 Full Time 97996.
2 Part Time 14474.
3 Per Event 4275
15 / 29

We can also remind ourselves how many staff members there were in each group.

lapd %>%
group_by(employment_type) %>%
summarize(med_base_pay = median(base_pay),
count = n())
# A tibble: 3 × 3
employment_type med_base_pay count
<fct> <dbl> <int>
1 Full Time 97996. 14664
2 Part Time 14474. 132
3 Per Event 4275 28

Note that n() does not take any arguments.

16 / 29
artists
# A tibble: 4 × 2
name followers
<chr> <dbl>
1 Beyoncé 24757958
2 Taylor Swift 33098116
3 Ariana Grande 51807131
4 Drake 50252529
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
21 / 29

22 / 29

Join Functions

something_join(x, y) All the join functions in dplyr package are in this fashion where x represents the first data frame and y represents the second data frame.

23 / 29
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
left_join(songs, albums, by = "album_name")
# A tibble: 5 × 5
name song_name album_name song_popularity album_release_d…
<chr> <chr> <chr> <dbl> <date>
1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29
2 Taylor Swift cardigan folklore 85 NA
3 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14
4 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14
5 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08

left_join() includes all rows from x

24 / 29
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
right_join(songs, albums, by = "album_name")
# A tibble: 4 × 5
name song_name album_name song_popularity album_release_d…
<chr> <chr> <chr> <dbl> <date>
1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29
2 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14
3 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14
4 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08

right_join() includes all rows from y

25 / 29
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
inner_join(songs, albums, by = "album_name")
# A tibble: 4 × 5
name song_name album_name song_popularity album_release_d…
<chr> <chr> <chr> <dbl> <date>
1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29
2 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14
3 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14
4 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08

inner_join() includes all rows that are in x and y

26 / 29
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
full_join(songs, albums, by = "album_name")
# A tibble: 5 × 5
name song_name album_name song_popularity album_release_d…
<chr> <chr> <chr> <dbl> <date>
1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29
2 Taylor Swift cardigan folklore 85 NA
3 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14
4 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14
5 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08

full_join() includes all rows that are in x or y

27 / 29
artists
# A tibble: 4 × 2
name followers
<chr> <dbl>
1 Beyoncé 24757958
2 Taylor Swift 33098116
3 Ariana Grande 51807131
4 Drake 50252529
songs
# A tibble: 5 × 4
name song_name album_name song_popularity
<chr> <chr> <chr> <dbl>
1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83
2 Taylor Swift cardigan folklore 85
3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95
4 Beyoncé Halo I AM…SASHA… NA
5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA
albums
# A tibble: 4 × 2
album_name album_release_date
<chr> <date>
1 Savage Remix (feat. Beyoncé) 2020-04-29
2 I AM…SASHA FIERCE 2008-11-14
3 Stuck with U 2020-05-08
4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14
full_join(artists, songs, by = "name") %>%
full_join(albums, by = "album_name")
# A tibble: 5 × 6
name followers song_name album_name song_popularity album_release_d…
<chr> <dbl> <chr> <chr> <dbl> <date>
1 Beyoncé 24757958 Savage Re… Savage Re… 83 2020-04-29
2 Beyoncé 24757958 Halo I AM…SASH… NA 2008-11-14
3 Taylor Swift 33098116 cardigan folklore 85 NA
4 Ariana Grande 51807131 Stuck wit… Stuck wit… NA 2020-05-08
5 Drake 50252529 Laugh Now… Laugh Now… 95 2020-08-14
28 / 29

Complete the questions provided to you in the lecture notes.

29 / 29

Review

Quiz

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow