class: title-slide <br> <br> .right-panel[ # Data Wrangling: <br>Part II ## Dr. Mine Dogucu ] --- class: middle ## Review Quiz --- class: middle ## Goals - Review aggregating data - Aggregating group data - Data joins - Lots of data wrangling practice --- class: middle .pull-left[ ## Data Observations ] .pull-left[ ## Aggregate Data Summaries of observations ] --- class: inverse middle .font75[Aggregating Categorical Data] --- class: middle Categorical data are summarized with **counts** or **proportions** --- class: middle ```r lapd %>% count(employment_type) ``` ``` # A tibble: 3 × 2 employment_type n <fct> <int> 1 Full Time 14664 2 Part Time 132 3 Per Event 28 ``` --- class: middle ```r lapd %>% count(employment_type) %>% mutate(prop = n/sum(n)) ``` ``` # A tibble: 3 × 3 employment_type n prop <fct> <int> <dbl> 1 Full Time 14664 0.989 2 Part Time 132 0.00890 3 Per Event 28 0.00189 ``` --- class: middle ## Mean ```r summarize(lapd, mean_base_pay = mean(base_pay)) ``` ``` # A tibble: 1 × 1 mean_base_pay <dbl> 1 85149. ``` --- class: middle Recall that we use multiple functions such as `mean()` and `median` within the summarize function ```r summarize(lapd, mean_base_pay = mean(base_pay), med_base_pay = median(base_pay)) ``` ``` # A tibble: 1 × 2 mean_base_pay med_base_pay <dbl> <dbl> 1 85149. 97601. ``` --- class: inverse middle .font75[Aggregating Data by Groups] --- class: middle `group_by()` <img src="img/data-wrangle.003.jpeg" width="80%" style="display: block; margin: auto;" /> `group_by()` separates the data frame by the groups. Any action following `group_by()` will be completed for each group separately. --- class: middle Q. What is the median salary for each employment type? --- ```r lapd %>% group_by(employment_type) ``` ``` # A tibble: 14,824 × 4 # Groups: employment_type [3] job_class_title employment_type base_pay base_pay_level <fct> <fct> <dbl> <chr> 1 Police Detective II Full Time 119322. Greater than Median 2 Police Sergeant I Full Time 113271. Greater than Median 3 Police Lieutenant II Full Time 148116 Greater than Median 4 Police Service Representative II Full Time 78677. Greater than Median 5 Police Officer III Full Time 109374. Greater than Median 6 Police Officer II Full Time 95002. Greater than Median # … with 14,818 more rows ``` --- ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay)) ``` ``` # A tibble: 3 × 2 employment_type med_base_pay <fct> <dbl> 1 Full Time 97996. 2 Part Time 14474. 3 Per Event 4275 ``` --- class: middle We can also remind ourselves how many staff members there were in each group. ```r lapd %>% group_by(employment_type) %>% summarize(med_base_pay = median(base_pay), count = n()) ``` ``` # A tibble: 3 × 3 employment_type med_base_pay count <fct> <dbl> <int> 1 Full Time 97996. 14664 2 Part Time 14474. 132 3 Per Event 4275 28 ``` Note that `n()` does not take any arguments. --- class: middle <img src="img/data_joins.png" width="340" style="display: block; margin: auto;" /> .font15[Image from [R for Data Science book](https://r4ds.had.co.nz/index.html) licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)] --- class: middle ## Inner Join <img src="img/inner_join.png" width="1056" style="display: block; margin: auto;" /> .font15[Image from [R for Data Science book](https://r4ds.had.co.nz/index.html) licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)] --- class: middle #### Outer Joins <img src="img/outer_joins.png" width="40%" style="display: block; margin: auto;" /> .font15[Image from [R for Data Science book](https://r4ds.had.co.nz/index.html) licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)] --- class: middle ## Data Joins <img src="img/join_venn.png" width="80%" style="display: block; margin: auto;" /> .font15[Image from [R for Data Science book](https://r4ds.had.co.nz/index.html) licensed under the [Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)] --- .panelset[ .panel[ .panel-name[artists] ```r artists ``` ``` # A tibble: 4 × 2 name followers <chr> <dbl> 1 Beyoncé 24757958 2 Taylor Swift 33098116 3 Ariana Grande 51807131 4 Drake 50252529 ``` ] .panel[ .panel-name[songs] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[albums] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] ] --- class: middle <img src="img/data_joins_spotify.png" width="90%" style="display: block; margin: auto;" /> --- class: middle ## Join Functions `something_join(x, y)` All the join functions in dplyr package are in this fashion where x represents the first data frame and y represents the second data frame. --- .panelset[ .panel[ .panel-name[x] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[y] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] .panel[ .panel-name[left_join()] ```r left_join(songs, albums, by = "album_name") ``` ``` # A tibble: 5 × 5 name song_name album_name song_popularity album_release_d… <chr> <chr> <chr> <dbl> <date> 1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29 2 Taylor Swift cardigan folklore 85 NA 3 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14 4 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14 5 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08 ``` ] ] `left_join()` includes all rows from x --- .panelset[ .panel[ .panel-name[x] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[y] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] .panel[ .panel-name[right_join()] ```r right_join(songs, albums, by = "album_name") ``` ``` # A tibble: 4 × 5 name song_name album_name song_popularity album_release_d… <chr> <chr> <chr> <dbl> <date> 1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29 2 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14 3 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14 4 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08 ``` ] ] `right_join()` includes all rows from `y` --- .panelset[ .panel[ .panel-name[x] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[y] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] .panel[ .panel-name[inner_join()] ```r inner_join(songs, albums, by = "album_name") ``` ``` # A tibble: 4 × 5 name song_name album_name song_popularity album_release_d… <chr> <chr> <chr> <dbl> <date> 1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29 2 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14 3 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14 4 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08 ``` ] ] `inner_join()` includes all rows that are in x **and** y --- .panelset[ .panel[ .panel-name[x] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[y] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] .panel[ .panel-name[full_join()] ```r full_join(songs, albums, by = "album_name") ``` ``` # A tibble: 5 × 5 name song_name album_name song_popularity album_release_d… <chr> <chr> <chr> <dbl> <date> 1 Beyoncé Savage Remix (… Savage Remix (… 83 2020-04-29 2 Taylor Swift cardigan folklore 85 NA 3 Drake Laugh Now Cry … Laugh Now Cry … 95 2020-08-14 4 Beyoncé Halo I AM…SASHA FIE… NA 2008-11-14 5 Ariana Grande Stuck with U (… Stuck with U NA 2020-05-08 ``` ] ] `full_join()` includes all rows that are in x **or** y --- .panelset[ .panel[ .panel-name[artists] ```r artists ``` ``` # A tibble: 4 × 2 name followers <chr> <dbl> 1 Beyoncé 24757958 2 Taylor Swift 33098116 3 Ariana Grande 51807131 4 Drake 50252529 ``` ] .panel[ .panel-name[songs] ```r songs ``` ``` # A tibble: 5 × 4 name song_name album_name song_popularity <chr> <chr> <chr> <dbl> 1 Beyoncé Savage Remix (feat. Beyoncé) Savage Rem… 83 2 Taylor Swift cardigan folklore 85 3 Drake Laugh Now Cry Later (feat. Lil Durk) Laugh Now … 95 4 Beyoncé Halo I AM…SASHA… NA 5 Ariana Grande Stuck with U (with Justin Bieber) Stuck with… NA ``` ] .panel[ .panel-name[albums] ```r albums ``` ``` # A tibble: 4 × 2 album_name album_release_date <chr> <date> 1 Savage Remix (feat. Beyoncé) 2020-04-29 2 I AM…SASHA FIERCE 2008-11-14 3 Stuck with U 2020-05-08 4 Laugh Now Cry Later (feat. Lil Durk) 2020-08-14 ``` ] .panel[ .panel-name[combined] ```r full_join(artists, songs, by = "name") %>% full_join(albums, by = "album_name") ``` ``` # A tibble: 5 × 6 name followers song_name album_name song_popularity album_release_d… <chr> <dbl> <chr> <chr> <dbl> <date> 1 Beyoncé 24757958 Savage Re… Savage Re… 83 2020-04-29 2 Beyoncé 24757958 Halo I AM…SASH… NA 2008-11-14 3 Taylor Swift 33098116 cardigan folklore 85 NA 4 Ariana Grande 51807131 Stuck wit… Stuck wit… NA 2020-05-08 5 Drake 50252529 Laugh Now… Laugh Now… 95 2020-08-14 ``` ] ] --- class: middle Complete the questions provided to you in the lecture notes.