# default_exp acsDownload

This Coding Notebook is the first in a series.

An Interactive version can be found here Open In Colab.

This colab and more can be found on our webpage.

  • Content covered in previous tutorials will be used in later tutorials.

  • New code and or information should have explanations and or descriptions attached.

  • Concepts or code covered in previous tutorials will be used without being explaining in entirety.

  • The Dataplay Handbook development techniques covered in the Datalabs Guidebook

  • If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.

  • This notebook has been optimized for Google Colabs ran on a Chrome Browser.

  • Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.

Binder Binder Binder Open Source Love svg3

NPM License Active Python Versions GitHub last commit No Maintenance Intended

GitHub stars GitHub watchers GitHub forks GitHub followers

Tweet Twitter Follow

About this Tutorial:

Whats inside?

The Tutorial

In this notebook, the basics of Colabs are introduced.

  • We will explore ACS data catalogs to locate data we like
  • We will programmatically download data from the American Community Survey (ACS)
    • Examples will use ACS table B19001, Baltimore City 2017 estimates
  • We will rework the datasets to be human friendly

Objectives

By the end of this tutorial users should have an understanding of:

  • Google Colabs
  • Census Data
  • Exploring, Retrieveing, and cleaning ACS Data programmatically
  • the 'retrieve_acs_data()' function, and how to use it in the future

Background

Using Colabs:

Instructions: Read all text and execute all code in order.

How to execute code:

  • Locate labels taking the form: 'Run: (A Short Description)'
  • Left of this text you will see an open bracket [ ], possibly with a number inside it.
    • Hovering over the brackets will reveal a play button.
    • Click the button to execute code.
  • Alternately: Click on a box with code and hit 'Shift' + 'Enter'

If you would like to see the code you are executing, double click the label 'Run: '. Code is accompanied with brief descriptions inlined.

Try It! Go ahead and try running the cell below. What you will be shown as a result is a flow chart of how this current tutorial may be used.

About The Census Data

Census or ACS Data?

Census data comes in 2 flavors:


  1. American Community Survey (ACS)
  • First released to the public in 2006
  • Derived using 5 Year Estimates (Past 5 years of data)
  • Delievered Annually
  • Delievered at Tract Level. (READ -> 'Geographic Granularity')
  • Margin of error generally prohibit a more granular view
  • ACS Data is accessible programmatically (READ -> 'ACS Programmatic Retrieval')(READ -> 'Geographic Reference Code')
  1. Decienial Census
  • Estimates usings 10 years of ACS data
  • Decenial Census data are created, in part, by ACS data.
  • Delievered once every 10 years.
  • Delivered at Block Level. Most Accurate

Geographic Granularity

Census data can come in a variety of levels.

These levels define the specificity of the data.

Ie. Weather a data is reporting on individual communities, or entire cities is contingent on the data granularity.

The data we will be downloading in this tutorial, ACS Data, can be found at the Tract level and no closer.

Aggregating Tracts is the way BNIA calculates some of their yearly community indicators!

Each of the bolded words in the list below are levels that are identifiable through a 'Geographic Reference Code'.

For more information on Geographic Reference Codes, refer to the table of contents for the section on that matter.

  • A census block is the smallest unit of measurement used by the Census
  • Information by census block is only available decenially (i.e. not ACS data)
  • Block groups are the next smallest unit of measurement used by the census and are composed of aggregate census blocks
  • Census tracts are composed of block groups and are the next largest unit of measurement used by the ACS
  • County, city and census designated places are composed of Tracts

Run the following code to see how these different levels nest into eachother!

Geographic Reference Codes

State, County, and Tract ID's are called Geographic Reference Codes.

This information is crucial to know when accessing data.

In order to successfully pull data, Census State and County Codes must be provided.

The code herin is configured by default to pull data on Baltimore City, MD and its constituent Tracts.

In order to find your State and County code:


Either

A) Here: where upon entering a unique address you can locate state and county codes under the associated values 'Counties' and 'State'

OR

B) Conversly, click here

  • The Geographies mainpage contains a lot of data assets.
  • The link to these tallies was located by accessing the geographical references subdirectory of the geographies mainpage and then filtered for publications made on the year 2010 (We are using the 2010 census boundaries)
  • Once clicked, simply scroll down to where you find the header 'Tallies of Geographic Entities By State'
  • Enter your state and press enter.
  • You will be redirected to a plain text file that will contain all the information on your state and its counties.

Working with the ACS Data

Searching for a dataset is the first step in the data processing pipeline.

In this tutorial we plan on processing ACS data in a programmatic fashion.

This tutorial will not just allow you to search/ explore ACS tables and inspect their contents (attributes), but also to download, format, and clean it!

Search Advice

Despite a table explorer section being provided, it is not suggested you use this approach, but rather, explore available data tables and retrieve their ID's using the dedicated websites provided below:

American Fact Finder may assist you in your data locating and download needs:

Fact Finder provides a nice interface to explore available datasets. From Fact Finder you can grab a Table's ID and continue the tutorial. Alternately, from Fact Finder, You can download the data for your community directly via an interface. From there, you may continue the tutorial by loading the downloaded dataset as an external resource, instructions on how to do this are provided further below in this tutorial.

Update : 12/18/2019 " American FactFinder (AFF) will remain as an "archive" system for accessing historical data until spring 2020. " - American Fact Finder Website

The New American Fact Finder

This new website is provided by the Census Org. Within its 'Advanced Search' feature exist all the filtering abilities of the older, depricated, (soon discontinued) American Fact Finder Website. It is still a bit buggy to date and may not apply all filters. Filters include years(can only pick on year at a time), geography(state county tract), topic, surveys and Table ID. The filters you apply are shown at the bottom of the query and submitting the search will yield data tables ready for download as well as table ID's that you may snag for use in this tutorial.

ACS Programmatic Retrieval

Developer Resources

Notes on the Census API

Tutorial Notes:

  • Details and Subject tables are derived using the 5 year ACS data.

    • As a reminder, estimates using (ACS) 5-year estimates arrive at the tract level
  • These tables are created by the census and are pre-compiled views of the data.

  • The Detail Tables contain all possible ACS Data.

  • The Subjects Table contains ACS data in convenient groups

  • BNIA create their data mostly using Details table, but sometimes pulling the data from a Subject Table is more convenient (the data would otherwise be found along multiple details tables).

ACS Website Notes:

  • Detailed Tables contain the most detailed cross-tabulations, many of which are published down to block groups. The data are population counts. There are over 20,000 variables in this dataset.

  • Subject Tables provide an overview of the estimates available in a particular topic. The data are presented as population counts and percentages. There are over 18,000 variables in this dataset.

For more Information (via API) Please Visit This Link

Guided Walkthrough

SETUP

Install these libraries onto the virtual environment.

! pip install  -U -q ipywidgets 
! pip install geopandas
# hide # @title Run: Install Modules # Install the Widgets Module. # Colabs does not locally provide this Python Library # The '!' is a special prefix used in colabs when talking to the terminal ! pip install -U -q ipywidgets ! pip install geopandas# Show entire column widths pd.set_option('display.max_colwidth', -1) pd.set_option('max_colwidth', 20) pd.set_option('display.expand_frame_repr', False) pd.set_option('display.precision', 2)# export # @title Run: Import Modules # Once installed we need to.. # import and configure the Widgets import ipywidgets as widgets from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = 'all' import ipywidgets as widgets from ipywidgets import interact, interact_manual # About importing data import urllib.request as urllib from urllib.parse import urlencode # This Prevents Timeouts when Importing import socket socket.setdefaulttimeout(10.0) # Pandas Data Manipulation Libraries import pandas as pd # Working with Json Data import json # Data Processing import numpy as np # Reading Json Data into Pandas from pandas.io.json import json_normalize # Export data as CSV import csv # Geo-Formatting # Postgres-Conversion import geopandas as gpd from geopandas import GeoDataFrame import psycopg2,pandas,numpy from shapely import wkb from shapely.wkt import loads import os import sys # In case file is KML # enable KML support; disabled by default import fiona fiona.drvsupport.supported_drivers['kml'] = 'rw' fiona.drvsupport.supported_drivers['KML'] = 'rw' # load libraries # from shapely.wkt import loads # from pandas import ExcelWriter # from pandas import ExcelFile import matplotlib.pyplot as plt import glob import imageio# hide %matplotlib inline !jupyter nbextension enable --py widgetsnbextension

Explore Table Directories

Please Note: The following section details a programmatic way to access and explore the census data catalogs. It is advised that rather than use this portion of the section of the tutorial, you read the section 'Searching For Data' --> 'Search Advice' above and which provide links to dedicated websites hosted by the census bureaue explicitly for your data exploration needs!

Explore the Detailed Table Directory

Retrieve and search available ACS datasets through the ACS's table directory.

The table directory contains TableId's and Descriptions for each datatable the ACS provides.

By running the next cell, an interactive searchbox will filter the directory for keywords within the description.

Be sure to grab the TableId once you find a table with a description of interest.

response = urllib.urlopen('https://api.census.gov/data/2017/acs/acs5/groups/')
metaDataTable = json_normalize( json.loads(response.read())['groups'] )
metaDataTable.set_index('name', drop=True, inplace=True)
description = input("Search ACS Table Directory by Keyword: ")
metaDataTable[ metaDataTable['description'].str.contains(description.upper()) ]
#hide #@title Run: Import Dataset Directory pd.set_option('display.max_columns', None) url = 'https://api.census.gov/data/2017/acs/acs5/groups/' response = urllib.urlopen(url) data = json.loads(response.read()) data = data['groups'] metaDataTable = json_normalize(data) metaDataTable.set_index('name', drop=True, inplace=True) #-------------------- # SEARCH BOX 1: This reliably produces a searhbox. # The ell must be reran for every query. #-------------------- description = input("Search ACS Table Directory by Keyword: ") metaDataTable[ metaDataTable['description'].str.contains(description.upper()) ] #-------------------- # SEARCH BOX 2: FOR CHROME USERS: # Commenting out the code above and running the code # below will update the searchbox in real time. #-------------------- # @interact # def tableExplorer(description='family'): # return metaDataTable[ metaDataTable['description'].str.contains(description.upper()) ]

Once you a table from the explorer has been picked, you can inspect its column names in the next part.

This will help ensure it has the data you need!

tableId = input("Please enter a Table ID to inspect: ")
url = f'https://api.census.gov/data/2017/acs/acs5/groups/{tableId}.json'
metaDataTable = pd.read_json(url).reset_index(inplace = True, drop=False) 
metaDataTable = pd.merge(
    json_normalize(data=metaDataTable['variables']), 
    metaDataTable['index'] , left_index=True, right_index=True )
metaDataTable = metaDataTable[['index', 'concept']].dropna(subset=['concept'])
#hide #@title Run: Interactive Table Lookup import json import pandas as pd from pandas.io.json import json_normalize pd.set_option('display.max_columns', None) #-------------------- # SEARCH BOX 1: This reliably produces a searchbox. # The ell must be reran for every query. #-------------------- tableId = input("Please enter a Table ID to inspect: ") url = f'https://api.census.gov/data/2017/acs/acs5/groups/{tableId}.json' metaDataTable = pd.read_json(url) metaDataTable.reset_index(inplace = True, drop=False) metaDataTable = pd.merge(json_normalize(data=metaDataTable['variables']), metaDataTable['index'] , left_index=True, right_index=True) metaDataTable = metaDataTable[['index', 'concept']] metaDataTable = metaDataTable.dropna(subset=['concept']) metaDataTable.head()

Explore the Subject Table Directory

The Data Structure we recieve is different than the prior table.

Intake and processing is different as a result.

Now lets explore what we got, just like before.

Only difference is that the column names are automatically included in this query.

url = 'https://api.census.gov/data/2017/acs/acs5/subject/variables.json'
data = json.loads(urllib.urlopen(url).read())['variables']
objArr = []
for key, value in data.items():
  value['name'] = key
  objArr.append(value)
metaDataTable = json_normalize(objArr).set_index('name', drop=True, inplace=True)
metaDataTable = metaDataTable[ ['attributes', 'concept', 'group', 'label', 'limit', 'predicateType' ] ]
concept = input("Search ACS Subject Table Directory by Keyword")
metaDataTable[ metaDataTable['concept'].str.contains(concept.upper(), na=False) ]
#hide #@title Run: Interactive Dataset Directory # Note the json representation url = 'https://api.census.gov/data/2017/acs/acs5/subject/variables.json' response = urllib.urlopen(url) # Decode the url response as json # https://docs.python.org/3/library/json.html data = json.loads(response.read()) # the json object contains all its information within attribute 'variables' data = data['variables'] # Process by flattening the raw json data objArr = [] for key, value in data.items(): value['name'] = key objArr.append(value) # Normalize semi-structured JSON data into a flat table. metaDataTable = json_normalize(objArr) # Set the column 'name' as an index. metaDataTable.set_index('name', drop=True, inplace=True) # Reduce the directory to only contain these attributes metaDataTable = metaDataTable[ ['attributes', 'concept', 'group', 'label', 'limit', 'predicateType' ] ] #-------------------- # SEARCH BOX 1: This reliably produces a searhbox. # The ell must be reran for every query. #-------------------- concept = input("Search ACS Subject Table Directory by Keyword") metaDataTable[ metaDataTable['concept'].str.contains(concept.upper(), na=False) ] #-------------------- # SEARCH BOX 2: FOR CHROME USERS: # Commenting out the code above and running the code # below will update the searchbox in real time. #-------------------- #@interact #def subjectExplorer(concept='transport'): # return metaDataTable[ metaDataTable['concept'].str.contains(concept.upper(), na=False) ]

Get Table Data

Intro

Hopefully, by now you know which datatable you would like to download!

The following Python function will do that for you.

  • It can be imported and used in future projects or stand alone.
#export # @title Run: Create retrieve_acs_data() #File: retrieveAcsData.py #Author: Charles Karpati #Date: 1/9/19 #Section: Bnia #Email: karpati1@umbc.edu #Description: #This file returns ACS data given an ID and Year # The county total is given a tract of '010000' #def retrieve_acs_data(): #purpose: Retrieves ACS data from the web #input: # state (required) # county (required) # tract (required) # tableId (required) # year (required) # includeCountyAgg (True)(todo) # replaceColumnNames (False)(todo) # save (required) #output: # Acs Data. # Prints to ../../data/2_cleaned/acs/ def retrieve_acs_data(state, county, tract, tableId, year, save): dictionary = '' keys = [] vals = [] header = [] keys1=keys2=keys3=keys4=keys5=keys6=keys7=keys8='' keyCount = 0 # Called in addKeys(), Will create the final URL for readIn() # These are parameters used in the API URL Query # This query will retrieve the census tracts def getParams(keys): return { 'get': 'NAME'+keys, 'for': 'tract:'+tract, 'in': 'state:'+state+' county:'+county, 'key': '829bf6f2e037372acbba32ba5731647c5127fdb0' } # Aggregate City data is best retrieved seperatly rather than as an aggregate of its constituent tracts def getCityParams(keys): return { 'get': 'NAME'+keys, 'for': 'county:'+county, 'in': 'state:'+state, 'key': '829bf6f2e037372acbba32ba5731647c5127fdb0' } # Called in AddKeys(). Requests data by url and preformats it. def readIn( url ): tbl = pd.read_json(url, orient='records') tbl.columns = tbl.iloc[0] return tbl # Called by retrieveAcsData. # Creates a url and retrieve the data # Then appends the city values as tract '010000' # Finaly it merges and returns the tract and city totals. def addKeys( table, params): # Get Tract and City Records For Specific Columns table2 = readIn( base+urlencode(getParams(params)) ) table3 = readIn( base+urlencode(getCityParams(params)) ) table3['tract'] = '010000' # Concatenate the Records table2.append([table2, table3], sort=False) table2 = pd.concat([table2, table3], ignore_index=True) # Merge to Master Table table = pd.merge(table, table2, how='left', left_on=["NAME","state","county","tract"], right_on = ["NAME","state","county","tract"]) return table #~~~~~~~~~~~~~~~ # Step 1) # Retrieve a Meta Data Table Describing the Content of the Table #~~~~~~~~~~~~~~~ url = 'https://api.census.gov/data/20'+year+'/acs/acs5/groups/'+tableId+'.json' metaDataTable = pd.read_json(url, orient='records') #~~~~~~~~~~~~~~~ # Step 2) # Createa a Dictionary using the Meta Data Table #~~~~~~~~~~~~~~~ # Multiple Queries may be Required. # Max columns returned from any given query is 50. # For that reasons bin the Columns into Groups of 50. for key in metaDataTable['variables'].keys(): if key[-1:] == 'E': keyCount = keyCount + 1 if keyCount < 40 : keys1 = keys1+','+key elif keyCount < 80 : keys2 = keys2+','+key elif keyCount < 120 : keys3 = keys3+','+key elif keyCount < 160 : keys4 = keys4+','+key elif keyCount < 200 : keys5 = keys5+','+key elif keyCount < 240 : keys6 = keys6+','+key elif keyCount < 280 : keys7 = keys7+','+key elif keyCount < 320 : keys8 = keys8+','+key keys.append(key) val = metaDataTable['variables'][key]['label'] # Column name formatting val = key+'_'+val.replace('Estimate!!', '').replace('!!', '_').replace(' ', '_') vals.append(val) dictionary = dict(zip(keys, vals)) #~~~~~~~~~~~~~~~ # Step 2) # Get the actual Table with the data we want using #~~~~~~~~~~~~~~~ # The URL we call is contingent on if the Table we want is a Detailed or Subject table url1 = 'https://api.census.gov/data/20'+year+'/acs/acs5?' url2 = 'https://api.census.gov/data/20'+year+'/acs/acs5/subject?' base = '' if tableId[:1] == 'B': base = url1 if tableId[:1] == 'S': base = url2 # The addKey function only works after the first set of columns has been downloaded # Download First set of Tract columns url = base+urlencode(getParams(keys1) ) table = pd.read_json(url, orient='records') table.columns = table.iloc[0] table = table.iloc[1:] # Download First set of Aggregate City data url = base+urlencode(getCityParams(keys1)) table2 = pd.read_json(url, orient='records') table2.columns = table2.iloc[0] table2 = table2[1:] table2['tract'] = '010000' # Merge EM #table = pd.concat([table, table2], keys=["NAME","state","county",], axis=0) table.append([table, table2], sort=False) table = pd.concat([table, table2], ignore_index=True) # Now we can repetedly use this function to add as many columns as there are keys listed from the meta data table if keys2 != '' : table = addKeys(table, keys2) if keys3 != '' : table = addKeys(table, keys3) if keys4 != '' : table = addKeys(table, keys4) if keys5 != '' : table = addKeys(table, keys5) if keys6 != '' : table = addKeys(table, keys6) if keys7 != '' : table = addKeys(table, keys7) if keys8 != '' : table = addKeys(table, keys8) #~~~~~~~~~~~~~~~ # Step 3) # Prepare Column Names using the meta data table. The raw data has columnsNames in the first row, as well. # Replace column ID's with labels from the dictionary where applicable (should be always) #~~~~~~~~~~~~~~~ print('Number of Columns', len(dictionary) ) header = [] for column in table.columns: if column in keys: header.append(dictionary[column]) else: header.append(column) table.columns = header # Prettify Names. Only happens with Baltimore... table['NAME'] = table['NAME'].str.replace(', Baltimore city, Maryland', '') table['NAME'][table['NAME'] == 'Baltimore city, Maryland'] = 'Baltimore City' # Convert to Integers Columns from Strings where Applicable table = table.apply(pd.to_numeric, errors='ignore') # Set the 'NAME' Column as the index dropping the default increment table.set_index("NAME", inplace = True) if save: # Save the raw data as 'TABLEID_5yYEAR.csv' table.to_csv('./'+state+county+'_'+tableId+'_5y'+year+'_est_Original.csv', quoting=csv.QUOTE_ALL) # Remove the id in the column names & Save the data as 'TABLEID_5yYEAR_est.csv' saveThis = table.rename( columns = lambda x : ( str(x)[:] if str(x) in [ "NAME","state","county","tract"] else str(x)[12:] ) ) saveThis.to_csv('./'+state+county+'_'+tableId+'_5y'+year+'_est.csv', quoting=csv.QUOTE_ALL) return table

Function Explanation

Description: This function returns ACS data given appropriate params.

Purpose: Retrieves ACS data from the web

Services

  • Download an ACS dataset from an Subject (S) table
  • Download an ACS dataset from a Details (B) table

Input:

  • state
  • county
  • tract
  • tableId
  • year
  • saveAcs

Output:

  • Acs Data.
  • Prints to ../../data/2_cleaned/acs/

How it works

  • Before our program retrieve the actual data, it will want the table's metadata.

    • This metadata will be used as a crosswalk to replace the awkward column names
    • If this is not done, only a column ID would denote each column. not human readable.
  • The Function changes the URL it requests data from depending on if it is an S or B type table the user has requested

  • Multiple calls for data must be made as a single table may have several hundred columns in them.

    • Constructing a table requires merging the data from multiple responces
  • Our program not just pulls tract level data but the aggregate for the county.

    • County totals are included automatically as 'tract 010000'.
      • The County total is not the sum of all other tracts but a seperate, indendent and unique query.
    • Tract and County Datatables must be merged to form a single dataset
  • Finally, we will download the data in two different formats if desired.

  • If we choose to save the data, we save it with the Table IDs + ColumnNames, and once without the TableIDs.

Function Diagrams

Function Examples

Now use this function to Download the Data!

# Our download function will use Baltimore City's tract, county and state as internal paramters # Change these values in the cell below using different geographic reference codes will change those parameters tract = '*' county = '153' # '059' # 153 '510' state = '51' # Specify the download parameters the function will receieve here tableId = 'B19049' # 'B19001' year = '17' saveAcs = True# state, county, tract, tableId, year, saveOriginal, save df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs) df.head()

SEARCH

CONNECT WITH US

DONATE

Help us keep this resource free and available to the public. Donate now!

Donate to BNIA-JFI

CONTACT US

Baltimore Neighborhood Indicators Alliance
The Jacob France Institute
1420 N. Charles Street, Baltimore, MD 21201
410-837-4377 | bnia-jfi@ubalt.edu