Working with Data in R

<br>
<br>
.right-panel[

# Working with Data in R
## Dr. Mine Dogucu
]

---

## Goals

- Data Frames (and Tibbles)
- Vectors (and lists)
- The pipe operator
- Changing variable names & types
- Summarizing variables

---

## Review

---

## Data Frames

A typical data frame has

`columns` that each represents a variable.  
`rows` that each represents an observation.

---

## Functions for Data Frames

```r
head(mtcars)
```

```
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---

## Functions for Data Frames

```r
tail(mtcars)
```

```
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
```

---

## Functions for Data Frames

```r
glimpse(mtcars)
```

```
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~
```

---

## Functions for Data Frames

Note that `summary()` function is useful beyond data frames.

```r
summary(mtcars)
```

```
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
```
---

### `Data Frames` vs. Tibbles

```
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
```

---

### Data Frames vs. `Tibbles`

```
# A tibble: 32 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# ... with 26 more rows
```

---

---

## Numeric Vectors (integer and double)

`SibSp` Number of Siblings/Spouses Aboard on Titanic

```
# A tibble: 891 x 2
  SibSp  Fare
  <int> <dbl>
1     1  7.25
2     1 71.3 
3     0  7.92
4     1 53.1 
5     0  8.05
6     0  8.46
# ... with 885 more rows
```

---

## Character Vectors

`SibSp` Number of Siblings/Spouses Aboard on Titanic

```
# A tibble: 891 x 2
  Name                                                Ticket          
  <chr>                                               <chr>           
1 Braund, Mr. Owen Harris                             A/5 21171       
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) PC 17599        
3 Heikkinen, Miss. Laina                              STON/O2. 3101282
4 Futrelle, Mrs. Jacques Heath (Lily May Peel)        113803          
5 Allen, Mr. William Henry                            373450          
6 Moran, Mr. James                                    330877          
# ... with 885 more rows
```

---

## Logical

Do you have any children under eighteen?

```
# A tibble: 1,040 x 1
  children_under_18
  <lgl>            
1 NA               
2 TRUE             
3 FALSE            
4 FALSE            
5 FALSE            
6 TRUE             
# ... with 1,034 more rows
```

---

## Factor

a.k.a. categorical variable in statistics

```
# A tibble: 891 x 1
  Embarked   
  <chr>      
1 Southampton
2 Cherbourg  
3 Southampton
4 Southampton
5 Southampton
6 Queenstown 
# ... with 885 more rows
```

---

## Ordered factor

Is it rude to recline your seat on a plane?

```
# A tibble: 1,040 x 1
  recline_rude
  <ord>       
1 <NA>        
2 Somewhat    
3 No          
4 No          
5 No          
6 No          
# ... with 1,034 more rows
```

---

## Vector Types in R

.footnote[Missing values are represented with NA in R. NULL represents anything that is undefined. Absence of a vector is often represented by NULL.]
---

## Augmented Vectors

Augmented vectors are atomic vectors that have additional metadata.

`factor` is an integer vector with levels.

`ordered factor` is an integer vector with ordered levels.

`date` is a numeric vector.

`date-time` is numeric vector.

---

### Creating Vectors with Multiple Elements

```r
78:83
```

```
[1] 78 79 80 81 82 83
```

```r
3.4:8.5
```

```
[1] 3.4 4.4 5.4 6.4 7.4 8.4
```

---

### Creating Vectors with Multiple Elements

```r
# A numeric vector
c(5, 7, 8) 
```

```
[1] 5 7 8
```

```r
# A character vector
c("Hello", "World", "today") 
```

```
[1] "Hello" "World" "today"
```

]

```r
# A character vector
c("Hello", "World", 5) 
```

```
[1] "Hello" "World" "5"    
```

Note that even if we use numeric and character values within a (atomic) vector, a (atomic) vector has only one type.
]
---

### Creating Vectors with Multiple Elements

```r
seq(from = 2, to = 4, by = 0.3)
```

```
[1] 2.0 2.3 2.6 2.9 3.2 3.5 3.8
```

```r
rep(1, times = 20)
```

```
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
```

.footnote[Note that in the R output the vector elements are enumerated at the beginning of each line.]

---

### Selecting elements of a vector

```r
names <- c("Menglin", "James", "Gloria")

names[2] 
```

```
[1] "James"
```

```r
names[2:3]
```

```
[1] "James"  "Gloria"
```

```r
names[-2]
```

```
[1] "Menglin" "Gloria" 
```

---