Activity: Data Import

2023-01-17

Getting Data

Data come in all shapes, sizes, and formats. Let’s say we are interested in understanding trends in how Americans get to work. The American Community Survey is a yearly demographics survey program conducted by the U.S. Census Bureau, gathering information on ancestry, citizenship, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics, (quoted from wikipedia) and contains just such information.

We’re going to import data from this webpage.

  1. Make a new markdown document to keep track of your work. Title it “Data Import Activity” and list your name as the author. You don’t have to list out your answers to each question, but the questions are numbered because the steps build on each other.

  2. Think about where you should save this markdown file. If you followed the instructions from the first activity and created a sensible file structure, this will be obvious. If not, go back and look at the first activity.

  3. Download the data from the two tables titled “GCT0802” and “GCT0804” as .csv files. Make sure you put the file in the same folder as your markdown document, otherwise, you will get an error at the next step.

  4. Import each file using read_csv, and save the resulting data frames into the variables alone_2021 and transit_2021, respectively. What problems do you encounter?

  5. What type of data does R think is in each column? Does it correspond to what you think should be there?

Renaming the columns

The variable we are interested in in each data frame is the ESTIMATE, which contains an estimate of the percentage of the population in each region with the characteristic (either “commutes by car/truck/van alone” or “public transit (excluding taxicabs)”). Let’s rename these columns to have more information.

  1. Install and load the janitor library. Use it to clean the variable names from each file; see the “Practical Advice” section from the R4DS book.

  2. Use the mutate function to rename the two estimate columns to commute_alone_pct or commute_transit_pct, respectively.

  3. Once you figure that out, rename the two mg_error columns to commute_alone_mg_error or commute_transit_mg_error, respectively, and convert the characters to positive numbers. You can find a demo on how to do this in the same book section.

“Wrangling” and joining

Since we aren’t interested in most of the columns, let’s use a couple new functions to select just the variables we want.

  1. Use the select function to pick out just the columns with geoid, geoname, and the percentage and margin of error columns.

Past Data

  1. Now use the following webpage to get the same data from 2019, and import them as alone_2019 and transit_2019. Repeat the process of cleaning the names and parsing the numbers.
alone_2021 <- read_csv("GCT0802.csv", comment="TITLE:") %>% 
  clean_names() %>%
  mutate(
    commute_alone_pct = estimate,
    commute_alone_mg_error = -parse_number(mg_error)
  ) %>%
  select(geoid, geoname, commute_alone_pct, commute_alone_mg_error)
## Parsed with column specification:
## cols(
##   TBLID = col_character(),
##   GEOID = col_character(),
##   GEONAME = col_character(),
##   PROFLN = col_double(),
##   ESTIMATE = col_double(),
##   MG_ERROR = col_character()
## )
transit_2021 <- read_csv("GCT0804.csv", comment="TITLE:") %>% 
  clean_names() %>%
  mutate(
    commute_transit_pct = estimate,
    commute_transit_mg_error = -parse_number(mg_error) 
  ) %>%
  select(geoid, geoname, commute_transit_pct, commute_transit_mg_error)
## Parsed with column specification:
## cols(
##   TBLID = col_character(),
##   GEOID = col_character(),
##   GEONAME = col_character(),
##   PROFLN = col_double(),
##   ESTIMATE = col_double(),
##   MG_ERROR = col_character()
## )

Putting it all together

Finally, we want to get all this information together. The first lines of this code join all the information from each year together. The second set of lines gives two ways to put all this information into one data frame.

  1. What are the advantages or disadvantages of storing the information in this way? Do either adhere to the “tidy” data principles?
commute_2019 <- alone_2019 %>% 
  inner_join(transit_2019, by=c("geoid", "geoname")) %>% 
  mutate(year = "2019", .after=geoname)
commute_2021 <- alone_2021 %>% 
  inner_join(transit_2021, by=c("geoid", "geoname")) %>% 
  mutate(year = "2021", .after=geoname)

# Way 1
commute_1 <- commute_2021 %>% bind_rows(commute_2019)
# Way 2
commute_2 <- commute_2019 %>% 
  select(-year) %>% 
  left_join(
    commute_2021 %>% select(-year), 
    by=c("geoname"), 
    suffix = c("_2019", "_2020")
  ) %>%
  select(-contains("geoid"))

We’ll use this dataframe for our next activity, too. So go ahead and save it using the command write_csv(commute_1, "commute.csv").