Data Wrangling

We are going to look at more transportation data, this time from the city of Chicago. We want to see information on the daily ridership of CTA stations.

url <- "https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"

all_stations_raw <- read_csv(url)

## Rows: 1136863 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): stationname, date, daytype
## dbl (1): station_id
## num (1): rides
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Column Actions

Taking a look at the data, there are a few columns we need to figure out how to deal with. First, we can convert date from text date to date-time data. Then we can drop the station ID, then we can turn the daytype variable into a factor with three levels (“weekday”, “saturday”, “sunday/holiday”).

all_stations <- all_stations_raw %>% 
  # Step 2: get columns and rename stationname
  select(station = stationname, date, rides, day_type = daytype) %>% 
  # Step 3: Convert the character date field to a date encoding.
  # This uses the "lubridate" package
  # Also, put the data in units of 1K rides
  # Convert day_type to a factor and rename the levels
  mutate(
    date = mdy(date), 
    rides = rides / 1000,
    day_type = recode(
      as.factor(day_type), 
      "W" = "weekday", "A" = "saturday", "U" = "sunday/holiday"
      )
    )

Column actions can also help us select all columns whose names have a certain pattern, e.g., starts_with. Here we select all variables of character type:

all_stations %>%
  select(where(is_character))

## # A tibble: 1,136,863 × 1
##    station                 
##    <chr>                   
##  1 Jefferson Park          
##  2 Cermak-Chinatown        
##  3 Central-Lake            
##  4 Dempster-Skokie         
##  5 Dempster                
##  6 Lake/State              
##  7 Oak Park-Forest Park    
##  8 Kedzie-Homan-Forest Park
##  9 35th/Archer             
## 10 Addison-North Main      
## # … with 1,136,853 more rows

Let’s talk about piping!

The pipe operator (either %>% or |>) is nice way to nest operations in a readable way. The pipe operator takes what is on the left of the pipe, and uses (pipes) it to the first argument of what is on the right.

For instance, the following two pieces of code are equivalent:

# with pipe
all_stations %>%
  mutate(month = month(date)) %>%
  glimpse()

## Rows: 1,136,863
## Columns: 5
## $ station  <chr> "Jefferson Park", "Cermak-Chinatown", "Central-Lake", "Dempst…
## $ date     <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-19, 2017-12-03, …
## $ rides    <dbl> 6.104, 3.636, 1.270, 1.759, 0.499, 8.615, 0.442, 1.353, 3.353…
## $ day_type <fct> weekday, weekday, saturday, weekday, sunday/holiday, saturday…
## $ month    <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…

#no pipe
glimpse( mutate(all_stations, month = month(date)) )

## Rows: 1,136,863
## Columns: 5
## $ station  <chr> "Jefferson Park", "Cermak-Chinatown", "Central-Lake", "Dempst…
## $ date     <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-19, 2017-12-03, …
## $ rides    <dbl> 6.104, 3.636, 1.270, 1.759, 0.499, 8.615, 0.442, 1.353, 3.353…
## $ day_type <fct> weekday, weekday, saturday, weekday, sunday/holiday, saturday…
## $ month    <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…

If you look at the help file for glimpse and mutate, both take a data frame as the first argument.

Finally, pipes interact nicely with some summary functions like this helpful skim function from the skimr package.

library(skimr)
all_stations %>%
  skim()

Data summary
Name	Piped data
Number of rows	1136863
Number of columns	4
_______________________
Column type frequency:
character	1
Date	1
factor	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
station	0	1	4	24	0	148	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2001-01-01	2022-10-31	2011-12-21	7974

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
day_type	0	1	FALSE	3	wee: 794105, sun: 180929, sat: 161829

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rides	0	1	3.07	3.12	0	0.97	2.02	4.04	36.32	▇▁▁▁▁

Row Actions

Row actions enable us to subset the rows. The most common function here si the filter function.

all_stations %>%
  filter(day_type == "weekday")

## # A tibble: 794,105 × 4
##    station                 date        rides day_type
##    <chr>                   <date>      <dbl> <fct>   
##  1 Jefferson Park          2017-12-22  6.10  weekday 
##  2 Cermak-Chinatown        2017-12-18  3.64  weekday 
##  3 Dempster-Skokie         2017-12-19  1.76  weekday 
##  4 35th/Archer             2017-12-07  3.35  weekday 
##  5 Addison-North Main      2017-12-19  6.03  weekday 
##  6 Chicago/State           2017-12-27  9.64  weekday 
##  7 Wellington              2017-12-07  3.21  weekday 
##  8 East 63rd-Cottage Grove 2017-12-26  0.613 weekday 
##  9 Grand/State             2017-12-21 10.7   weekday 
## 10 Wilson                  2017-12-22  4.86  weekday 
## # … with 794,095 more rows

To filter on more than one condition, we can use & (and) or | (or). An & statement will be true only if both sides of the statement are true, and an | statement is true if either side is true:

print(TRUE & FALSE)

## [1] FALSE

print(TRUE & TRUE)

## [1] TRUE

print(FALSE | TRUE)

## [1] TRUE

all_stations %>%
  filter(day_type == "weekday" & year(date) > 2020)

## # A tibble: 66,781 × 4
##    station      date       rides day_type
##    <chr>        <date>     <dbl> <fct>   
##  1 Harlem-Lake  2022-06-27 1.70  weekday 
##  2 Harlem-Lake  2022-06-28 1.81  weekday 
##  3 Harlem-Lake  2022-06-29 1.62  weekday 
##  4 Harlem-Lake  2022-06-30 1.81  weekday 
##  5 Pulaski-Lake 2022-06-01 0.728 weekday 
##  6 Pulaski-Lake 2022-06-02 0.667 weekday 
##  7 Pulaski-Lake 2022-06-03 0.794 weekday 
##  8 Pulaski-Lake 2022-06-06 0.672 weekday 
##  9 Pulaski-Lake 2022-06-07 0.716 weekday 
## 10 Pulaski-Lake 2022-06-08 0.714 weekday 
## # … with 66,771 more rows

Another common action is the arrange function. Use this to find the most recent date in the data.

all_stations %>%
  arrange(desc(date)) %>%
  head()

## # A tibble: 6 × 4
##   station            date       rides day_type
##   <chr>              <date>     <dbl> <fct>   
## 1 Austin-Forest Park 2022-10-31 0.729 weekday 
## 2 Harlem-Lake        2022-10-31 1.6   weekday 
## 3 Pulaski-Lake       2022-10-31 0.588 weekday 
## 4 Quincy/Wells       2022-10-31 3.03  weekday 
## 5 Davis              2022-10-31 1.33  weekday 
## 6 Belmont-O'Hare     2022-10-31 2.69  weekday

Operations on groups

Grouping is very powerful! Let’s say I am not so interested in the daily ridership on the CTA. All the ride data is spread across different stations! But if we group by date, then we can call summary functions that compute statistics within each group.

all_stations %>%
  group_by(date) %>%
  #compute two new vars: total_rides to count up all the rides in each group, and num to see how big each group is
  summarize(total_rides = sum(rides), num = n())

## # A tibble: 7,974 × 3
##    date       total_rides   num
##    <date>           <dbl> <int>
##  1 2001-01-01        106.   141
##  2 2001-01-02        419.   141
##  3 2001-01-03        448.   141
##  4 2001-01-04        459.   141
##  5 2001-01-05        466.   141
##  6 2001-01-06        213.   141
##  7 2001-01-07        142.   141
##  8 2001-01-08        493.   141
##  9 2001-01-09        501.   141
## 10 2001-01-10        503.   141
## # … with 7,964 more rows

Read the book to find other helpful functions to use within summarize, e.g. to calculate the mean, standard deviation, etc.

Saving the result of the pipe

Notice that all of these actions didn’t change the data frame! If we want to save the results, we have to assign the result to a new variable using <-.

Putting it all together

Let’s see if we can plot ridership over time. For simplicity, let’s look only at rides since 2018. We can even use the pipe with ggplot, but once you call ggplot, make sure to switch to + when you add layers!

all_stations %>% 
  filter(year(date) > 2018) %>%
  group_by(date) %>% 
  summarize(rides = sum(rides)) %>%
  ggplot(aes(x=date, y=rides)) +
  geom_line()

This isn’t very nice to look at! That’s because there are \(4 \times 365\) days to plot on here. Let’s see if we can use our skills to get information on weekly ridership. Here’s an extra variable to get you started, which gives the week number of the year.

all_stations <- all_stations %>%
  mutate(week = isoweek(date))

What do you notice on your plot? How do you explain the patterns you are seeing. Are there any deficencies in this method for finding weekly ridership, and can you think of any better ways to go? Talk to your group, and let me know what you think!