We are going to look at more transportation data, this time from the city of Chicago. We want to see information on the daily ridership of CTA stations.
<- "https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
url
<- read_csv(url) all_stations_raw
## Rows: 1136863 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): stationname, date, daytype
## dbl (1): station_id
## num (1): rides
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Column Actions
Taking a look at the data, there are a few columns we need to figure
out how to deal with. First, we can convert date
from text
date to date-time data. Then we can drop the station ID, then we can
turn the daytype
variable into a factor with three levels
(“weekday”, “saturday”, “sunday/holiday”).
<- all_stations_raw %>%
all_stations # Step 2: get columns and rename stationname
select(station = stationname, date, rides, day_type = daytype) %>%
# Step 3: Convert the character date field to a date encoding.
# This uses the "lubridate" package
# Also, put the data in units of 1K rides
# Convert day_type to a factor and rename the levels
mutate(
date = mdy(date),
rides = rides / 1000,
day_type = recode(
as.factor(day_type),
"W" = "weekday", "A" = "saturday", "U" = "sunday/holiday"
) )
Column actions can also help us select all columns whose names have a
certain pattern, e.g., starts_with
. Here we select all
variables of character type:
%>%
all_stations select(where(is_character))
## # A tibble: 1,136,863 × 1
## station
## <chr>
## 1 Jefferson Park
## 2 Cermak-Chinatown
## 3 Central-Lake
## 4 Dempster-Skokie
## 5 Dempster
## 6 Lake/State
## 7 Oak Park-Forest Park
## 8 Kedzie-Homan-Forest Park
## 9 35th/Archer
## 10 Addison-North Main
## # … with 1,136,853 more rows
Let’s talk about piping!
The pipe operator (either %>%
or |>
)
is nice way to nest operations in a readable way. The pipe operator
takes what is on the left of the pipe, and uses (pipes) it to
the first argument of what is on the right.
For instance, the following two pieces of code are equivalent:
# with pipe
%>%
all_stations mutate(month = month(date)) %>%
glimpse()
## Rows: 1,136,863
## Columns: 5
## $ station <chr> "Jefferson Park", "Cermak-Chinatown", "Central-Lake", "Dempst…
## $ date <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-19, 2017-12-03, …
## $ rides <dbl> 6.104, 3.636, 1.270, 1.759, 0.499, 8.615, 0.442, 1.353, 3.353…
## $ day_type <fct> weekday, weekday, saturday, weekday, sunday/holiday, saturday…
## $ month <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
#no pipe
glimpse( mutate(all_stations, month = month(date)) )
## Rows: 1,136,863
## Columns: 5
## $ station <chr> "Jefferson Park", "Cermak-Chinatown", "Central-Lake", "Dempst…
## $ date <date> 2017-12-22, 2017-12-18, 2017-12-02, 2017-12-19, 2017-12-03, …
## $ rides <dbl> 6.104, 3.636, 1.270, 1.759, 0.499, 8.615, 0.442, 1.353, 3.353…
## $ day_type <fct> weekday, weekday, saturday, weekday, sunday/holiday, saturday…
## $ month <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
If you look at the help file for glimpse
and
mutate
, both take a data frame as the first argument.
Finally, pipes interact nicely with some summary functions like this
helpful skim
function from the skimr
package.
library(skimr)
%>%
all_stations skim()
Name | Piped data |
Number of rows | 1136863 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
factor | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
station | 0 | 1 | 4 | 24 | 0 | 148 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 2001-01-01 | 2022-10-31 | 2011-12-21 | 7974 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
day_type | 0 | 1 | FALSE | 3 | wee: 794105, sun: 180929, sat: 161829 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rides | 0 | 1 | 3.07 | 3.12 | 0 | 0.97 | 2.02 | 4.04 | 36.32 | ▇▁▁▁▁ |
Row Actions
Row actions enable us to subset the rows. The most common function here si the filter function.
%>%
all_stations filter(day_type == "weekday")
## # A tibble: 794,105 × 4
## station date rides day_type
## <chr> <date> <dbl> <fct>
## 1 Jefferson Park 2017-12-22 6.10 weekday
## 2 Cermak-Chinatown 2017-12-18 3.64 weekday
## 3 Dempster-Skokie 2017-12-19 1.76 weekday
## 4 35th/Archer 2017-12-07 3.35 weekday
## 5 Addison-North Main 2017-12-19 6.03 weekday
## 6 Chicago/State 2017-12-27 9.64 weekday
## 7 Wellington 2017-12-07 3.21 weekday
## 8 East 63rd-Cottage Grove 2017-12-26 0.613 weekday
## 9 Grand/State 2017-12-21 10.7 weekday
## 10 Wilson 2017-12-22 4.86 weekday
## # … with 794,095 more rows
To filter on more than one condition, we can use &
(and) or |
(or). An &
statement will be
true only if both sides of the statement are true, and an |
statement is true if either side is true:
print(TRUE & FALSE)
## [1] FALSE
print(TRUE & TRUE)
## [1] TRUE
print(FALSE | TRUE)
## [1] TRUE
%>%
all_stations filter(day_type == "weekday" & year(date) > 2020)
## # A tibble: 66,781 × 4
## station date rides day_type
## <chr> <date> <dbl> <fct>
## 1 Harlem-Lake 2022-06-27 1.70 weekday
## 2 Harlem-Lake 2022-06-28 1.81 weekday
## 3 Harlem-Lake 2022-06-29 1.62 weekday
## 4 Harlem-Lake 2022-06-30 1.81 weekday
## 5 Pulaski-Lake 2022-06-01 0.728 weekday
## 6 Pulaski-Lake 2022-06-02 0.667 weekday
## 7 Pulaski-Lake 2022-06-03 0.794 weekday
## 8 Pulaski-Lake 2022-06-06 0.672 weekday
## 9 Pulaski-Lake 2022-06-07 0.716 weekday
## 10 Pulaski-Lake 2022-06-08 0.714 weekday
## # … with 66,771 more rows
Another common action is the arrange
function. Use this
to find the most recent date in the data.
%>%
all_stations arrange(desc(date)) %>%
head()
## # A tibble: 6 × 4
## station date rides day_type
## <chr> <date> <dbl> <fct>
## 1 Austin-Forest Park 2022-10-31 0.729 weekday
## 2 Harlem-Lake 2022-10-31 1.6 weekday
## 3 Pulaski-Lake 2022-10-31 0.588 weekday
## 4 Quincy/Wells 2022-10-31 3.03 weekday
## 5 Davis 2022-10-31 1.33 weekday
## 6 Belmont-O'Hare 2022-10-31 2.69 weekday
Operations on groups
Grouping is very powerful! Let’s say I am not so interested in the
daily ridership on the CTA. All the ride data is spread across different
stations! But if we group by date
, then we can call summary
functions that compute statistics within each group.
%>%
all_stations group_by(date) %>%
#compute two new vars: total_rides to count up all the rides in each group, and num to see how big each group is
summarize(total_rides = sum(rides), num = n())
## # A tibble: 7,974 × 3
## date total_rides num
## <date> <dbl> <int>
## 1 2001-01-01 106. 141
## 2 2001-01-02 419. 141
## 3 2001-01-03 448. 141
## 4 2001-01-04 459. 141
## 5 2001-01-05 466. 141
## 6 2001-01-06 213. 141
## 7 2001-01-07 142. 141
## 8 2001-01-08 493. 141
## 9 2001-01-09 501. 141
## 10 2001-01-10 503. 141
## # … with 7,964 more rows
Read the book to find other helpful functions to use within summarize, e.g. to calculate the mean, standard deviation, etc.
Saving the result of the pipe
Notice that all of these actions didn’t change the data frame! If we
want to save the results, we have to assign the result to a new variable
using <-
.
Putting it all together
Let’s see if we can plot ridership over time. For simplicity, let’s
look only at rides since 2018. We can even use the pipe with
ggplot
, but once you call ggplot
, make sure to
switch to +
when you add layers!
%>%
all_stations filter(year(date) > 2018) %>%
group_by(date) %>%
summarize(rides = sum(rides)) %>%
ggplot(aes(x=date, y=rides)) +
geom_line()
This isn’t very nice to look at! That’s because there are \(4 \times 365\) days to plot on here. Let’s see if we can use our skills to get information on weekly ridership. Here’s an extra variable to get you started, which gives the week number of the year.
<- all_stations %>%
all_stations mutate(week = isoweek(date))
What do you notice on your plot? How do you explain the patterns you are seeing. Are there any deficencies in this method for finding weekly ridership, and can you think of any better ways to go? Talk to your group, and let me know what you think!