Getting Data
Data come in all shapes, sizes, and formats. Let’s say we are interested in understanding trends in how Americans get to work. The American Community Survey is a yearly demographics survey program conducted by the U.S. Census Bureau, gathering information on ancestry, citizenship, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics, (quoted from wikipedia) and contains just such information.
We’re going to import data from this webpage.
Make a new markdown document to keep track of your work. Title it “Data Import Activity” and list your name as the author. You don’t have to list out your answers to each question, but the questions are numbered because the steps build on each other.
Think about where you should save this markdown file. If you followed the instructions from the first activity and created a sensible file structure, this will be obvious. If not, go back and look at the first activity.
Download the data from the two tables titled “GCT0802” and “GCT0804” as
.csv
files. Make sure you put the file in the same folder as your markdown document, otherwise, you will get an error at the next step.Import each file using
read_csv
, and save the resulting data frames into the variablesalone_2021
andtransit_2021
, respectively. What problems do you encounter?What type of data does
R
think is in each column? Does it correspond to what you think should be there?
Renaming the columns
The variable we are interested in in each data frame is the
ESTIMATE
, which contains an estimate of the percentage of
the population in each region with the characteristic (either “commutes
by car/truck/van alone” or “public transit (excluding taxicabs)”). Let’s
rename these columns to have more information.
Install and load the
janitor
library. Use it to clean the variable names from each file; see the “Practical Advice” section from the R4DS book.Use the mutate function to rename the two
estimate
columns tocommute_alone_pct
orcommute_transit_pct
, respectively.Once you figure that out, rename the two
mg_error
columns tocommute_alone_mg_error
orcommute_transit_mg_error
, respectively, and convert the characters to positive numbers. You can find a demo on how to do this in the same book section.
“Wrangling” and joining
Since we aren’t interested in most of the columns, let’s use a couple
new functions to select
just the variables we want.
- Use the
select
function to pick out just the columns withgeoid
,geoname
, and the percentage and margin of error columns.
Past Data
- Now use the following webpage
to get the same data from 2019, and import them as
alone_2019
andtransit_2019
. Repeat the process of cleaning the names and parsing the numbers.
<- read_csv("GCT0802.csv", comment="TITLE:") %>%
alone_2021 clean_names() %>%
mutate(
commute_alone_pct = estimate,
commute_alone_mg_error = -parse_number(mg_error)
%>%
) select(geoid, geoname, commute_alone_pct, commute_alone_mg_error)
## Parsed with column specification:
## cols(
## TBLID = col_character(),
## GEOID = col_character(),
## GEONAME = col_character(),
## PROFLN = col_double(),
## ESTIMATE = col_double(),
## MG_ERROR = col_character()
## )
<- read_csv("GCT0804.csv", comment="TITLE:") %>%
transit_2021 clean_names() %>%
mutate(
commute_transit_pct = estimate,
commute_transit_mg_error = -parse_number(mg_error)
%>%
) select(geoid, geoname, commute_transit_pct, commute_transit_mg_error)
## Parsed with column specification:
## cols(
## TBLID = col_character(),
## GEOID = col_character(),
## GEONAME = col_character(),
## PROFLN = col_double(),
## ESTIMATE = col_double(),
## MG_ERROR = col_character()
## )
Putting it all together
Finally, we want to get all this information together. The first lines of this code join all the information from each year together. The second set of lines gives two ways to put all this information into one data frame.
- What are the advantages or disadvantages of storing the information in this way? Do either adhere to the “tidy” data principles?
<- alone_2019 %>%
commute_2019 inner_join(transit_2019, by=c("geoid", "geoname")) %>%
mutate(year = "2019", .after=geoname)
<- alone_2021 %>%
commute_2021 inner_join(transit_2021, by=c("geoid", "geoname")) %>%
mutate(year = "2021", .after=geoname)
# Way 1
<- commute_2021 %>% bind_rows(commute_2019)
commute_1 # Way 2
<- commute_2019 %>%
commute_2 select(-year) %>%
left_join(
%>% select(-year),
commute_2021 by=c("geoname"),
suffix = c("_2019", "_2020")
%>%
) select(-contains("geoid"))
We’ll use this dataframe for our next activity, too. So go ahead and
save it using the command
write_csv(commute_1, "commute.csv")
.