A typical data frame has
columns
that each represents a variable.rows
that each represents an observation.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carbPorsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
glimpse(mtcars)
Rows: 32Columns: 11$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~
Note that summary()
function is useful beyond data frames.
summary(mtcars)
mpg cyl disp hp Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 drat wt qsec vs Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000
Data Frames
vs. Tibbles mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Tibbles
# A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>1 21 6 160 110 3.9 2.62 16.5 0 1 4 42 21 6 160 110 3.9 2.88 17.0 0 1 4 43 22.8 4 108 93 3.85 2.32 18.6 1 1 4 14 21.4 6 258 110 3.08 3.22 19.4 1 0 3 15 18.7 8 360 175 3.15 3.44 17.0 0 0 3 26 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1# ... with 26 more rows
Vectors in R
SibSp
Number of Siblings/Spouses Aboard on Titanic
# A tibble: 891 x 2 SibSp Fare <int> <dbl>1 1 7.252 1 71.3 3 0 7.924 1 53.1 5 0 8.056 0 8.46# ... with 885 more rows
SibSp
Number of Siblings/Spouses Aboard on Titanic
# A tibble: 891 x 2 Name Ticket <chr> <chr> 1 Braund, Mr. Owen Harris A/5 21171 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) PC 17599 3 Heikkinen, Miss. Laina STON/O2. 31012824 Futrelle, Mrs. Jacques Heath (Lily May Peel) 113803 5 Allen, Mr. William Henry 373450 6 Moran, Mr. James 330877 # ... with 885 more rows
Do you have any children under eighteen?
# A tibble: 1,040 x 1 children_under_18 <lgl> 1 NA 2 TRUE 3 FALSE 4 FALSE 5 FALSE 6 TRUE # ... with 1,034 more rows
a.k.a. categorical variable in statistics
# A tibble: 891 x 1 Embarked <chr> 1 Southampton2 Cherbourg 3 Southampton4 Southampton5 Southampton6 Queenstown # ... with 885 more rows
Is it rude to recline your seat on a plane?
# A tibble: 1,040 x 1 recline_rude <ord> 1 <NA> 2 Somewhat 3 No 4 No 5 No 6 No # ... with 1,034 more rows
Missing values are represented with NA in R. NULL represents anything that is undefined. Absence of a vector is often represented by NULL.
Augmented vectors are atomic vectors that have additional metadata.
factor
is an integer vector with levels.
ordered factor
is an integer vector with ordered levels.
date
is a numeric vector.
date-time
is numeric vector.
78:83
[1] 78 79 80 81 82 83
3.4:8.5
[1] 3.4 4.4 5.4 6.4 7.4 8.4
# A numeric vectorc(5, 7, 8)
[1] 5 7 8
# A character vectorc("Hello", "World", "today")
[1] "Hello" "World" "today"
# A character vectorc("Hello", "World", 5)
[1] "Hello" "World" "5"
Note that even if we use numeric and character values within a (atomic) vector, a (atomic) vector has only one type.
seq(from = 2, to = 4, by = 0.3)
[1] 2.0 2.3 2.6 2.9 3.2 3.5 3.8
rep(1, times = 20)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Note that in the R output the vector elements are enumerated at the beginning of each line.
names <- c("Menglin", "James", "Gloria")names[2]
[1] "James"
names[2:3]
[1] "James" "Gloria"
names[-2]
[1] "Menglin" "Gloria"
mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7[31] 15.0 21.4
mtcars[1,] # selects first row
mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
mtcars[1:3,] # selects first through third row
mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
mtcars[,5] # selects fifth column
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62[31] 3.54 4.11
mtcars[1:2,5] # selects fifth column of first and second row
[1] 3.9 3.9
Lists
my_list <- list("Gloria", 7, c(8,15.56,15, 16), c(TRUE, FALSE), mtcars)
my_list
[[1]][1] "Gloria"[[2]][1] 7[[3]][1] 8.00 15.56 15.00 16.00[[4]][1] TRUE FALSE[[5]] mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
my_list[[5]]
mpg cyl disp hp drat wt qsec vs am gear carbMazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
my_list[[5]][2]
cylMazda RX4 6Mazda RX4 Wag 6Datsun 710 4Hornet 4 Drive 6Hornet Sportabout 8Valiant 6Duster 360 8Merc 240D 4Merc 230 4Merc 280 6Merc 280C 6Merc 450SE 8Merc 450SL 8Merc 450SLC 8Cadillac Fleetwood 8Lincoln Continental 8Chrysler Imperial 8Fiat 128 4Honda Civic 4Toyota Corolla 4Toyota Corona 4Dodge Challenger 8AMC Javelin 8Camaro Z28 8Pontiac Firebird 8Fiat X1-9 4Porsche 914-2 4Lotus Europa 4Ford Pantera L 8Ferrari Dino 6Maserati Bora 8Volvo 142E 4
my_list[[3]][3]
[1] 15
small_list <- list(3, c(5, 2, 5.6))long_list <- list("STATS 295", small_list)long_list
[[1]][1] "STATS 295"[[2]][[2]][[1]][1] 3[[2]][[2]][1] 5.0 2.0 5.6
str(my_list)
List of 5 $ : chr "Gloria" $ : num 7 $ : num [1:4] 8 15.6 15 16 $ : logi [1:2] TRUE FALSE $ :'data.frame': 32 obs. of 11 variables: ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ... ..$ disp: num [1:32] 160 160 108 258 360 ... ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ... ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ... ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ... ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ... ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ... ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ... ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
str(long_list)
List of 2 $ : chr "STATS 295" $ :List of 2 ..$ : num 3 ..$ : num [1:3] 5 2 5.6
The pipe operator
Three solutions to a single problem
What is the average of 4, 8, 16 approximately?
1.What is the average of 4, 8, 16 approximately?
2.What is the average of 4, 8, 16 approximately?
3.What is the average of 4, 8, 16 approximately?
Solution 1: Functions within Functions
c(4, 8, 16)
[1] 4 8 16
c(4, 8, 16)
[1] 4 8 16
mean(c(4, 8, 16))
[1] 9.333333
c(4, 8, 16)
[1] 4 8 16
mean(c(4, 8, 16))
[1] 9.333333
round(mean(c(4, 8, 16)))
[1] 9
Problem with writing functions within functions
Things will get messy and more difficult to read and debug as we deal with more complex operations on data.
Solution 2: Creating Objects
numbers <- c(4, 8, 16)numbers
[1] 4 8 16
numbers <- c(4, 8, 16)numbers
[1] 4 8 16
avg_number <- mean(numbers)avg_number
[1] 9.333333
numbers <- c(4, 8, 16)numbers
[1] 4 8 16
avg_number <- mean(numbers)avg_number
[1] 9.333333
round(avg_number)
[1] 9
Problem with creating many objects
We will end up with too many objects in Environment
.
Solution 3: The (forward) Pipe Operator %>%
Shortcut:
Ctrl (Command) + Shift + M
c(4, 8, 16) %>% mean() %>% round()
[1] 9
Combine 4, 8, and 16 and then
Take the mean and then
Round the output
c(4, 8, 16) %>% mean() %>% round()
[1] 9
Combine 4, 8, and 16 and then
Take the mean and then
Round the output
The output of a function becomes the first argument of the next function.
Recall composite functions such as f∘g(x)?
Recall composite functions such as f∘g(x)?
Now we have f∘g∘h(x) or round(mean(c(4, 8, 16)))
Recall composite functions such as f∘g(x)?
Now we have f∘g∘h(x) or round(mean(c(4, 8, 16)))
h(x) %>% g() %>% f()
c(4, 8, 16) %>% mean() %>% round()
library(magrittr)
Treachery of Images by René Magritte
Image for Treachery of Images is from University of Alabama website and used under fair use for educational purposes.
Changing Variable Names and Types
glimpse(lapd)
Rows: 14,824Columns: 3$ `Department Title` <chr> "Police (LAPD)", "Police (LAPD)", "Police (LAPD)", ~$ `Base Pay` <dbl> 119321.60, 113270.70, 148116.00, 78676.87, 109373.6~$ `Employment Type` <chr> "Full Time", "Full Time", "Full Time", "Full Time",~
clean_names()
changes variable names consistent with the tidyverse style.
clean_names(lapd)
# A tibble: 14,824 x 3 department_title base_pay employment_type <chr> <dbl> <chr> 1 Police (LAPD) 119322. Full Time 2 Police (LAPD) 113271. Full Time 3 Police (LAPD) 148116 Full Time 4 Police (LAPD) 78677. Full Time 5 Police (LAPD) 109374. Full Time 6 Police (LAPD) 95002. Full Time # ... with 14,818 more rows
clean_names(lapd) %>% rename(dept_title = department_title)
# A tibble: 14,824 x 3 dept_title base_pay employment_type <chr> <dbl> <chr> 1 Police (LAPD) 119322. Full Time 2 Police (LAPD) 113271. Full Time 3 Police (LAPD) 148116 Full Time 4 Police (LAPD) 78677. Full Time 5 Police (LAPD) 109374. Full Time 6 Police (LAPD) 95002. Full Time # ... with 14,818 more rows
More than one variable within a single rename()
function can be renamed.
clean_names(lapd) %>% rename(dept_title = department_title, emp_type = employment_type)
# A tibble: 14,824 x 3 dept_title base_pay emp_type <chr> <dbl> <chr> 1 Police (LAPD) 119322. Full Time2 Police (LAPD) 113271. Full Time3 Police (LAPD) 148116 Full Time4 Police (LAPD) 78677. Full Time5 Police (LAPD) 109374. Full Time6 Police (LAPD) 95002. Full Time# ... with 14,818 more rows
mutate()
function helps make create new variables or make changes to existing ones.
clean_names(lapd) %>% rename(dept_title = department_title, emp_type = employment_type) %>% mutate(emp_type2 = as.factor(emp_type))
# A tibble: 14,824 x 4 dept_title base_pay emp_type emp_type2 <chr> <dbl> <chr> <fct> 1 Police (LAPD) 119322. Full Time Full Time2 Police (LAPD) 113271. Full Time Full Time3 Police (LAPD) 148116 Full Time Full Time4 Police (LAPD) 78677. Full Time Full Time5 Police (LAPD) 109374. Full Time Full Time6 Police (LAPD) 95002. Full Time Full Time# ... with 14,818 more rows
We normally would not call the new variable as emp_type2
instead we would call it emp_type
to override the older version.
clean_names(lapd) %>% rename(dept_title = department_title, emp_type = employment_type) %>% mutate(emp_type = as.factor(emp_type))
# A tibble: 14,824 x 3 dept_title base_pay emp_type <chr> <dbl> <fct> 1 Police (LAPD) 119322. Full Time2 Police (LAPD) 113271. Full Time3 Police (LAPD) 148116 Full Time4 Police (LAPD) 78677. Full Time5 Police (LAPD) 109374. Full Time6 Police (LAPD) 95002. Full Time# ... with 14,818 more rows
Changes to other vector types are also possible with the following functions
as.numeric()
as.double()
as.integer()
as.character()
as.logical()
Why does lapd object does not reflect any of the data cleaning that we have accomplished?
lapd
# A tibble: 14,824 x 3 `Department Title` `Base Pay` `Employment Type` <chr> <dbl> <chr> 1 Police (LAPD) 119322. Full Time 2 Police (LAPD) 113271. Full Time 3 Police (LAPD) 148116 Full Time 4 Police (LAPD) 78677. Full Time 5 Police (LAPD) 109374. Full Time 6 Police (LAPD) 95002. Full Time # ... with 14,818 more rows
We can overwrite the old lapd
object by assigning the cleaner version of lapd
lapd <- clean_names(lapd) %>% rename(dept_title = department_title, emp_type = employment_type) %>% mutate(emp_type = as.factor(emp_type))
lapd
# A tibble: 14,824 x 3 dept_title base_pay emp_type <chr> <dbl> <fct> 1 Police (LAPD) 119322. Full Time2 Police (LAPD) 113271. Full Time3 Police (LAPD) 148116 Full Time4 Police (LAPD) 78677. Full Time5 Police (LAPD) 109374. Full Time6 Police (LAPD) 95002. Full Time# ... with 14,818 more rows
Summarizing Numeric Variables
mean()
median()
sd()
var()
min()
max()
quantile()
summarize(lapd, mean(base_pay))
# A tibble: 1 x 1 `mean(base_pay)` <dbl>1 85149.
summarize(lapd, mean(base_pay))
# A tibble: 1 x 1 `mean(base_pay)` <dbl>1 85149.
mean(lapd$base_pay)
[1] 85149.05
We can get multiple summaries with one summarize()
function.
summarize(lapd, mean(base_pay), median(base_pay))
# A tibble: 1 x 2 `mean(base_pay)` `median(base_pay)` <dbl> <dbl>1 85149. 97601.
Note how the variables names in this table is not easy to read.
In order to display the variable names more legibly in the output, we can assign variable names to numerical summaries (e.g. mean_base_pay
).
summarize(lapd, mean_base_pay = mean(base_pay), med_base_pay = median(base_pay))
# A tibble: 1 x 2 mean_base_pay med_base_pay <dbl> <dbl>1 85149. 97601.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |