--- title: From ACS Download to Gif keywords: fastai sidebar: home_sidebar summary: "In this tutorial, we run the full gambit, from downloading acs data to creating map gifs across several years of data" description: "In this tutorial, we run the full gambit, from downloading acs data to creating map gifs across several years of data" nb_path: "notebooks/05_ACS_From_Download_To_Gif.ipynb" ---
This Coding Notebook is the fifth in a series.
An Interactive version can be found here .
This colab and more can be found on our webpage.
Content covered in previous tutorials will be used in later tutorials.
New code and or information should have explanations and or descriptions attached.
Concepts or code covered in previous tutorials will be used without being explaining in entirety.
The Dataplay Handbook development techniques covered in the Datalabs Guidebook
If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.
This notebook has been optimized for Google Colabs ran on a Chrome Browser.
Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.
In this colab
Median Household Income is just one of the many operations that may perfrom on the data using publicly available code found in our github codebase.
This colab and more can be found at https://gist.github.com/bniajfi
Developers Resource: https://www.census.gov/developers/
ACS API: https://www.census.gov/data/developers/data-sets.html
*please note
You will need to run this next box first in order for any of the code after it to work
%%capture
! pip install -U -q PyDrive
! pip install geopy
! pip install geopandas
! pip install geoplot
! pip install dexplot
! pip install dataplay
Now use this function to Download the Data!
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'
# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = True
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()
pd.set_option('max_colwidth', 20)
mapUrl = "https://opendata.arcgis.com/datasets/b738a8587b6d479a8824d937892701d8_0.geojson"
# mapUrl = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv"
csaMap = readInGeometryData(url=mapUrl, porg='g', geom='geometry', lat=False, lng=False, revgeocode=False, save=False, in_crs=2248, out_crs=2248)
csaMap.head()
csaMap.plot()
csaMap = pd.DataFrame(csaMap)
# Instructions to do this yoruself: https://support.google.com/docs/answer/183965?co=GENIE.Platform%3DDesktop&hl=en
# Primary Table
left_ds = df
left_col = 'tract'
# Crosswalk Table
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
# Secondary Table
# Table: Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
merge_how = 'geometry'
interactive = True
merge_how = 'outer'
merged = mergeDatasets( left_ds=df, left_col='tract',
use_crosswalk=True, crosswalk_ds=crosswalk_ds,
crosswalk_left_col = 'TRACT2010', crosswalk_right_col = 'CSA2010',
right_ds=csaMap, right_col='CSA2010',
merge_how=merge_how, interactive = False )
merged.columns = merged.columns.str.replace("$", "")
merged.head()
gdf = gpd.GeoDataFrame(merged, geometry='geometry')
gdf.plot()
gdf.head()
gdf['tempsCount'] = 1
temp = gdf.groupby('CSA2010').sum(numeric_only=True)
temp = temp.reset_index()
temp
# mergedGeom = readInGeometryData(url=merged, porg='g', geom='geometry', lat=False, lng=False, revgeocode=False, save=False, in_crs=2248, out_crs=2248)
Geographic data may be crosswalked as well. More information in the 'Maps' section.
Awesome!
By this point you should be able to download a dataset, and crosswalk new columns onto it by matching on 'tract'
What we are going to do now is perform calculations using these newly created datasets.
Run the next few cells to create our calculatory functions
# These functions right here are used in the calculations below.
# Finds a column matchings a substring
def getColName (df, col): return df.columns[df.columns.str.contains(pat = col)][0]
def getColByName (df, col): return df[getColName(df, col)]
# Pulls a column from one dataset into a new dataset.
# This is not a crosswalk. calls getColByName()
def addKey(df, fi, col):
key = getColName(df, col)
val = getColByName(df, col)
fi[key] = val
return fi
# Return 0 if two specified columns are equal.
def nullIfEqual(df, c1, c2):
return df.apply(lambda x:
x[getColName(df, c1)]+x[getColName(df, c2)] if x[getColName(df, c1)]+x[getColName(df, c2)] != 0 else 0, axis=1)
# I'm thinking this doesnt need to be a function..
def sumInts(df): return df.sum(numeric_only=True)
#File: mhhi.py
#Author: Charles Karpati
#Date: 1/24/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B19001 - HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS)
# Universe: Households
# Table Creates: hh25 hh40 hh60 hh75 hhm75, mhhi
#purpose: Produce Sustainability - Percent of Population that Walks to Work Indicator
#input:
#output:
import pandas as pd
import glob
def mhhi( df, columnsToInclude = [] ):
#~~~~~~~~~~~~~~~
# Step 2)
# Prepare the columns
#~~~~~~~~~~~~~~~
info = pd.DataFrame(
[
['B19001_002E', 0, 10000],
['B19001_003E', 10000, 4999 ],
['B19001_004E', 15000, 4999 ],
['B19001_005E', 20000, 4999 ],
['B19001_006E', 25000, 4999 ],
['B19001_007E', 30000, 4999],
['B19001_008E', 35000, 4999 ],
['B19001_009E', 40000, 4999 ],
['B19001_010E', 45000, 4999 ],
['B19001_011E', 50000, 9999 ],
['B19001_012E', 60000, 14999],
['B19001_013E', 75000, 24999 ],
['B19001_014E', 100000, 24999 ],
['B19001_015E', 125000, 24999 ],
['B19001_016E', 150000, 49000 ],
['B19001_017E', 200000, 1000000000000000000000000 ],
],
columns=['variable', 'lower', 'range']
)
# Final Dataframe
data_table = pd.DataFrame()
for index, row in info.iterrows():
data_table = addKey(df, data_table, row['variable'])
# Accumulate totals accross the columns.
# Midpoint: Divide column index 16 (the last column) of the cumulative totals
temp_table = data_table.cumsum(axis=1)
temp_table['midpoint'] = (temp_table.iloc[ : , -1 :] /2) # V3
temp_table['midpoint_index'] = False
temp_table['midpoint_index_value'] = False # Z3
temp_table['midpoint_index_lower'] = False # W3
temp_table['midpoint_index_range'] = False # X3
temp_table['midpoint_index_minus_one_cumulative_sum'] = False #Y3
# step 3 - csa_agg3: get the midpoint index by "when midpoint > agg[1] and midpoint <= agg[2] then 2"
# Get CSA Midpoint Index using the breakpoints in our info table.
for index, row in temp_table.iterrows():
# Get the index of the first column where our midpoint is greater than the columns value.
midpoint = row['midpoint']
midpoint_index = 0
# For each column (except the 6 columns we just created)
# The tracts midpoint was < than the first tracts value at column 'B19001_002E_Total_Less_than_$10,000'
if( midpoint < int(row[0]) or row[-6] == False ):
temp_table.loc[ index, 'midpoint_index' ] = 0
else:
for column in row.iloc[:-6]:
# set midpoint index to the column with the highest value possible that is under midpoint
if( midpoint >= int(column) ):
if midpoint==False: print (str(column) + ' - ' + str(midpoint))
temp_table.loc[ index, 'midpoint_index' ] = midpoint_index +1
midpoint_index += 1
# temp_table = temp_table.drop('Unassigned--Jail')
for index, row in temp_table.iterrows():
temp_table.loc[ index, 'midpoint_index_value' ] = data_table.loc[ index, data_table.columns[row['midpoint_index']] ]
temp_table.loc[ index, 'midpoint_index_lower' ] = info.loc[ row['midpoint_index'] ]['lower']
temp_table.loc[ index, 'midpoint_index_range' ] = info.loc[ row['midpoint_index'] ]['range']
temp_table.loc[ index, 'midpoint_index_minus_one_cumulative_sum'] = row[ row['midpoint_index']-1 ]
# This is our denominator, which cant be negative.
for index, row in temp_table.iterrows():
if row['midpoint_index_value']==False:
temp_table.at[index, 'midpoint_index_value']=1;
#~~~~~~~~~~~~~~~
# Step 3)
# Run the Calculation
# Calculation = (midpoint_lower::numeric + (midpoint_range::numeric * ( (midpoint - midpoint_upto_agg) / nullif(midpoint_total,0)
# Calculation = W3+X3*((V3-Y3)/Z3)
# v3 -> 1 - midpoint of households == sum / 2
# w3 -> 2 - lower limit of the income range containing the midpoint of the housing total == row[lower]
# x3 -> width of the interval containing the medium == row[range]
# z3 -> number of hhs within the interval containing the median == row[total]
# y3 -> 4 - cumulative frequency up to, but no==NOT including the median interval
#~~~~~~~~~~~~~~~
def finalCalc(x):
return ( x['midpoint_index_lower']+ x['midpoint_index_range']*(
( x['midpoint']-x['midpoint_index_minus_one_cumulative_sum'])/ x['midpoint_index_value'] )
)
temp_table['final'] = temp_table.apply(lambda x: finalCalc(x), axis=1)
columnsToInclude.append('tract')
print ('INCLUDING COLUMN(s):' + str(columnsToInclude))
temp_table[columnsToInclude] = df[columnsToInclude]
#~~~~~~~~~~~~~~~
# Step 4)
# Add Special Baltimore City Data
#~~~~~~~~~~~~~~~
# url = 'https://api.census.gov/data/20'+str(year)+'/acs/acs5/subject?get=NAME,S1901_C01_012E&for=county%3A510&in=state%3A24&key=829bf6f2e037372acbba32ba5731647c5127fdb0'
# table = pd.read_json(url, orient='records')
# temp_table['final']['Baltimore City'] = float(table.loc[1, table.columns[1]])
return temp_table
#File: trav45.py
#Author: Charles Karpati
#Date: 1/17/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B08303 - TRAVEL TIME TO WORK,
# (Universe: Workers 16 years and over who did not work at home)
# Table Creates: trav14, trav29, trav44, trav45
#purpose: Produce Sustainability - Percent of Employed Population with Travel Time to Work of 45 Minutes and Over Indicator
#input:
#output:
import pandas as pd
import glob
def trav45(df, columnsToInclude = [] ):
#~~~~~~~~~~~~~~~
# Step 2)
# Prepare the columns
#~~~~~~~~~~~~~~~
# Final Dataframe
fi = pd.DataFrame()
columns = ['B08303_011E','B08303_012E','B08303_013E','B08303_001E', 'tract']
columns.extend(columnsToInclude)
for col in columns:
fi = addKey(df, fi, col)
# Numerators
numerators = pd.DataFrame()
columns = ['B08303_011E','B08303_012E','B08303_013E']
for col in columns:
numerators = addKey(df, numerators, col)
# Denominators
denominators = pd.DataFrame()
columns = ['B08303_001E']
for col in columns:
denominators = addKey(df, denominators, col)
# construct the denominator, returns 0 iff the other two rows are equal.
#~~~~~~~~~~~~~~~
# Step 3)
# Run the Calculation
# ( (value[1] + value[2] + value[3] ) / nullif(value[4],0) )*100
#~~~~~~~~~~~~~~~
fi['numerator'] = numerators.sum(axis=1)
fi['denominator'] = denominators.sum(axis=1)
fi = fi[fi['denominator'] != 0] # Delete Rows where the 'denominator' column is 0
fi['final'] = (fi['numerator'] / fi['denominator'] ) * 100
return fi
#File: trav44.py
#Author: Charles Karpati
#Date: 1/17/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B08303 - TRAVEL TIME TO WORK,
# (Universe: Workers 16 years and over who did not work at home)
# Table Creates: trav14, trav29, trav44, trav45
#purpose: Produce Sustainability - Percent of Employed Population with Travel Time to Work of 30-44 Minutes Indicator
#input:
#output:
import pandas as pd
import glob
def trav44( df, columnsToInclude = [] ):
#~~~~~~~~~~~~~~~
# Step 2)
# Prepare the columns
#~~~~~~~~~~~~~~~
# Final Dataframe
fi = pd.DataFrame()
columns = ['B08303_008E','B08303_009E','B08303_010E','B08303_001E', 'tract']
columns.extend(columnsToInclude)
for col in columns:
fi = addKey(df, fi, col)
# Numerators
numerators = pd.DataFrame()
columns = ['B08303_008E','B08303_009E','B08303_010E']
for col in columns:
numerators = addKey(df, numerators, col)
# Denominators
denominators = pd.DataFrame()
columns = ['B08303_001E']
for col in columns:
denominators = addKey(df, denominators, col)
# construct the denominator, returns 0 iff the other two rows are equal.
#~~~~~~~~~~~~~~~
# Step 3)
# Run the Calculation
# ( (value[1] + value[2] + value[3] ) / nullif(value[4],0) )*100
#~~~~~~~~~~~~~~~
fi['numerator'] = numerators.sum(axis=1)
fi['denominator'] = denominators.sum(axis=1)
fi = fi[fi['denominator'] != 0] # Delete Rows where the 'denominator' column is 0
fi['final'] = (fi['numerator'] / fi['denominator'] ) * 100
return fi
#File: affordr.py
#Author: Charles Karpati
#Date: 1/17/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B25070 - GROSS RENT AS A PERCENTAGE OF HOUSEHOLD INCOME IN THE PAST 12 MONTHS
# Universe: Renter-occupied housing units
#purpose: Produce Housing and Community Development - Affordability Index - Rent Indicator
#input:
#output:
import pandas as pd
import glob
def affordr( df, columnsToInclude ):
#~~~~~~~~~~~~~~~
# Step 2)
# Prepare the columns
#~~~~~~~~~~~~~~~
# Final Dataframe
fi = pd.DataFrame()
columns = ['B25070_007E','B25070_008E','B25070_009E','B25070_010E','B25070_001E', 'tract']
columns.extend(columnsToInclude)
for col in columns:
fi = addKey(df, fi, col)
# Numerators
numerators = pd.DataFrame()
columns = ['B25070_007E','B25070_008E','B25070_009E','B25070_010E']
for col in columns:
numerators = addKey(df, numerators, col)
# Denominators
denominators = pd.DataFrame()
columns = ['B25070_001E']
for col in columns:
denominators = addKey(df, denominators, col)
# construct the denominator, returns 0 iff the other two rows are equal.
#~~~~~~~~~~~~~~~
# Step 3)
# Run the Calculation
# ( (value[1]+value[2]+value[3]+value[4]) / nullif(value[5],0) )*100
#~~~~~~~~~~~~~~~
fi['numerator'] = numerators.sum(axis=1)
fi['denominator'] = denominators.sum(axis=1)
fi = fi[fi['denominator'] != 0] # Delete Rows where the 'denominator' column is 0
fi['final'] = (fi['numerator'] / fi['denominator'] ) * 100
return fi
#File: affordm.py
#Author: Charles Karpati
#Date: 1/25/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B25091 - MORTGAGE STATUS BY SELECTED MONTHLY OWNER COSTS AS A PERCENTAGE OF HOUSEHOLD INCOME IN THE PAST 12 MONTHS
# Universe: Owner-occupied housing units
# Table Creates:
#purpose: Produce Housing and Community Development - Affordability Index - Mortgage Indicator
#input:
#output:
import pandas as pd
import glob
def affordm( df, columnsToInclude ):
#~~~~~~~~~~~~~~~
# Step 1)
# Prepare the columns
#~~~~~~~~~~~~~~~
# Final Dataframe
fi = pd.DataFrame()
columns = ['B25091_008E','B25091_009E','B25091_010E','B25091_011E','B25091_002E', 'tract']
columns.extend(columnsToInclude)
for col in columns:
fi = addKey(df, fi, col)
# Numerators
numerators = pd.DataFrame()
columns = ['B25091_008E','B25091_009E','B25091_010E','B25091_011E']
for col in columns:
numerators = addKey(df, numerators, col)
# Denominators
denominators = pd.DataFrame()
columns = ['B25091_002E']
for col in columns:
denominators = addKey(df, denominators, col)
# construct the denominator, returns 0 iff the other two rows are equal.
#~~~~~~~~~~~~~~~
# Step 3)
# Run the Calculation
# ( (value[1]+value[2]+value[3]+value[4]) / nullif(value[5],0) )*100
#~~~~~~~~~~~~~~~
fi['numerator'] = numerators.sum(axis=1)
fi['denominator'] = denominators.sum(axis=1)
fi = fi[fi['denominator'] != 0] # Delete Rows where the 'denominator' column is 0
fi['final'] = (fi['numerator'] / fi['denominator'] ) * 100
return fi
#File: age5.py
#Author: Charles Karpati
#Date: 4/16/19
#Section: Bnia
#Email: karpati1@umbc.edu
#Description:
# Uses ACS Table B01001 - SEX BY AGE
# Universe: Total population
# Table Creates: tpop, female, male, age5 age18 age24 age64 age65
#purpose:
#input: #output:
import pandas as pd
import glob
def age5( df, columnsToInclude ):
#~~~~~~~~~~~~~~~
# Step 1)
# Prepare the columns
#~~~~~~~~~~~~~~~
# Final Dataframe
fi = pd.DataFrame()
columns = ['B01001_027E_Total_Female_Under_5_years',
'B01001_003E_Total_Male_Under_5_years',
'B01001_001E_Total' , 'tract']
columns.extend(columnsToInclude)
for col in columns:
fi = addKey(df, fi, col)
# Under 5
fi['final'] = ( df[ 'B01001_003E_Total_Male_Under_5_years' ]
+ df[ 'B01001_027E_Total_Female_Under_5_years' ]
) / df['B01001_001E_Total'] * 100
return fi
Now that our calculations have been created, lets create a final function that will download our data, optionally crosswalk and optionally aggregate it to the community level, and then run/return the appropriate calculation.
def createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col,
saveCrosswalked, saveCrosswalkedFileName, groupBy,
aggMethod, method, columnsToInclude, finalFileName):
# Pull the data
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
print('Table: ' + tableId + ', Year: ' + year + ' imported.')
# Get the crosswalk
if cwUrl:
crosswalk = pd.read_csv( cwUrl )
print('Crosswalk file imported')
# Merge crosswalk with the data
df = mergeDatasets( df, crosswalk, local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked, saveCrosswalkedFileName )
print('Table merged')
# Group and Aggregate
if groupBy:
df = df.groupby(groupBy)
print('Grouped')
if aggMethod == 'sum':
df = sumInts(df)
else:
df = sumInts(df)
print('Aggregated')
print('Creating Indicator')
# Create the indicator
resp = method( df, columnsToInclude)
print('Indicator Created')
resp.to_csv(finalFileName, quoting=csv.QUOTE_ALL)
print('File Saved')
return resp
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'
# Specify the download parameters the acs download function will receieve here
year = '17'
saveAcs = True
# Specify the crosswalk parameters
baltimore = '2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE'
cwUrl = 'https://docs.google.com/spreadsheets/d/e/' + baltimore + '/pub?output=csv'
local_match_col = 'tract'
foreign_match_col= 'TRACT2010'
foreign_wanted_col= 'CSA2010'
saveCrosswalked = True
crosswalkedFileName = False
groupBy = 'CSA2010'
aggMethod = 'sum'
columnsToInclude = []
# Alternatively
# groupBy = False
#columnsToInclude = ['CSA2010']
tableId = 'B19001'
finalFileName = './mhhi_12July2019_yes.csv'
# Group By Crosswalked column. Included automatically in final result
# groupBy = 'CSA2010'
# columnsToInclude = []
# Do Not Group, Include the Crosswalked Column in the final result
groupBy = False
columnsToInclude = ['CSA2010']
method = mhhi
ind1 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
ind1.head()
tableId = 'B08303'
finalFileName = './trav45_20'+year+'_tracts_26July2019.csv'
method = trav45
ind2 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
tableId = 'B08303'
finalFileName = './trav44_20'+year+'_tracts_26July2019.csv'
method = trav44
ind3 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
tableId = 'B25070'
finalFileName = './affordr_20'+year+'_tracts_26July2019.csv'
method = affordr
ind4 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
tableId = 'B25091'
finalFileName = './affordm_20'+year+'_tracts_26July2019.csv'
method = affordm
ind5 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
tableId = 'B01001'
finalFileName = './age5_20'+year+'_communities_9Sept2019.csv'
method = age5
groupBy = 'CSA2010'
columnsToInclude = []
ind5 = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col, saveCrosswalked,
crosswalkedFileName, groupBy, aggMethod, method, columnsToInclude, finalFileName)
ind5.head()
Census Geographic Data:
# pd.set_option('display.precision', 2)
# pd.reset_option('max_colwidth')
pd.set_option('max_colwidth', 20)
# pd.reset_option('max_colwidth')
Importing Point Data:
tableId = 'B19001'
finalFileName = './mhhi_12July2019_yes.csv'
baltimore = '2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE'
cwUrl = 'https://docs.google.com/spreadsheets/d/e/' + baltimore + '/pub?output=csv'
method = mhhi
local_match_col = 'tract'
foreign_match_col= 'TRACT2010'
foreign_wanted_col= 'CSA2010'
aggMethod = 'sum'
groupBy = False
columnsToInclude = ['CSA2010']
df = createIndicator(state, county, tract, year, tableId, saveAcs, cwUrl,
local_match_col, foreign_match_col, foreign_wanted_col,
saveCrosswalked, crosswalkedFileName, groupBy, aggMethod,
method, columnsToInclude, finalFileName)
df.head()
Importing Geom Data
# string = 'boundariesr-baltimore-tracts-NoWater-2010'
# string = 'boundaries-baltimore-communities-NoWater-2010'
# string = 'boundaries-baltimore-neighborhoods'
# string = 'boundaries-moco-census-tracts-2010'
# string = 'boundaries-maryland_census-tracts-2010'
# string = 'boundaries-maryland-counties'
# string = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRWBYggh3LGJ3quU-PhGXT2NvAtUb3aiXdZVKAO5VWCreUWZpAGz1uTbLvq6rF1TrJNiE81o6R5AP8F/pub?output=csv' # GoogleSpreadsheet Baltimore Tracts
# string = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSIJpLSmdQkvqJ3Wk6ONJmj_qHBgG_1naDxd0KcNyrT2LoJhqhoRSMtY1kyjy__xZ4Y8UN-tMNNAUa-/pub?output=csv' # GoogleSpreadsheet Evanston Tracts
# The file extension is used to determine the appropriate import method.
ext = '.csv'
# ext = '.geojson'
# ext = '.json'
# ext = '.kml'
# ext = 'googleSpreadSheet'
# This is where the findFile function comes in handy
url = findFile('../', string+ext)
geomData = ''
if ext == '.geojson' or ext == '.kml' or ext == '.json':
print(url)
geomData = gpd.read_file(url)
if ext == '.csv' or ext == 'googleSpreadSheet':
if ext == 'googleSpreadSheet':
url = string
geomData = pd.read_csv(url)
geomData['geometry'] = geomData['geometry'].apply(lambda x: loads( str(x) ))
geomData = GeoDataFrame(geomData, geometry='geometry')
geomData.plot()
geomData.head()
Save the Geom Dataset as geojson and csv form if not already
if ext == '.csv':
print('saving csv as geojson')
geomData.to_file(string+".geojson", driver='GeoJSON')
if ext == '.geojson':
print('saving geojson as csv')
geomData.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
if ext == '.json':
print('saving json as csv and geojson')
geomData.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
geomData.to_file(string+".geojson", driver='GeoJSON')
if ext == '.kml':
print('saving kml as csv and geojson')
geomData.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
geomData.to_file(string+".geojson", driver='GeoJSON')
Crosswalking Boundarys:
We can use the Geometry Data as a crosswalk and pass over the geometries to our principal dataset.
The crosswalkIt function must have been previously ran for this option to work.
local_match_col = 'tract'
foreign_match_col = 'TRACTCE10'
#local_match_col = 'CSA2010'
#foreign_match_col = 'CSA'
foreign_wanted_col = 'geometry'
gdf = crosswalkIt( geomData, df, local_match_col, foreign_match_col, foreign_wanted_col, True, False )
# After crosswalking our boundaries, we need to make sure they are being interpreted like that.
gdf = GeoDataFrame(gdf, geometry=foreign_wanted_col)
gdf.crs
type(gdf)
gdf.columns
gdf.head()
gdf.plot()
# Ensure your aggregation is correctly set for the crosswalk to work.
Reading in the Names Crosswalk with a URL.
We will Read in our Boundaries Crosswalk next.
evanston = '2PACX-1vSIJpLSmdQkvqJ3Wk6ONJmj_qHBgG_1naDxd0KcNyrT2LoJhqhoRSMtY1kyjy__xZ4Y8UN-tMNNAUa-'
baltComm = '2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE'
url = 'https://docs.google.com/spreadsheets/d/e/' + evanston + '/pub?output=csv'
namesCw = pd.read_csv( url )
namesCw.columns
namesCw.head()
Reading in the Boundary Data with a filePath. We specify the geometry on import this time.
evanston = '2PACX-1vSIJpLSmdQkvqJ3Wk6ONJmj_qHBgG_1naDxd0KcNyrT2LoJhqhoRSMtY1kyjy__xZ4Y8UN-tMNNAUa-'
url = 'https://docs.google.com/spreadsheets/d/e/' + evanston + '/pub?output=csv'
boundsCw = pd.read_csv(url)
boundsCw['geometry'] = boundsCw['geometry'].apply(lambda x: loads( str(x) ))
boundsCw = GeoDataFrame(boundsCw, geometry='geometry')
# file = findFile('../', 'boundaries-evanston-tracts-2010.geojson')
#boundsCw = gpd.read_file(file, geometry='geometry')
boundsCw.columns
boundsCw.plot()
Settings for the data we will pull
state = '17'
county = '031'
tract = '*'
tableId = 'B08303'
years = ['17','16', '15']
saveAcs = True
numer = ['001', '002', '003']
denom = ['001', '002', '003', '004', '005', '006', '007', '008', '009', '010', '011' ]
saveAcs = True
saveCw1 = True
cwlk1FileName = 'Example_NamedCrosswalk.csv'
saveCw2 = True
cwlk2FileName = 'Example_BoundaryCrosswalk.csv'
groupBy = False
#groupBy = 'CSA2010'
aggMethod = 'sum'
namesCw = namesCw
names_local_match_col = 'tract'
names_foreign_match_col = 'Tract10Num'
names_foreign_wanted_col = 'INTPTLAT10'
boundsCw = boundsCw
bounds_local_match_col = 'tract'
bounds_foreign_match_col = 'Tract10Num'
bounds_foreign_wanted_col = 'geometry'
def getAcsYears(state, county, tract, years,
tableId, numer, denom,
namesCw, lcw1Match, fcw1Match, fcw1Want, groupBy, aggMethod,
boundsCw, lcw2Match, fcw2Match, fcw2Want,
saveAcs, saveCw1, cwlk1FileName, saveCw2, cwlk2FileName):
# Get the data
count = 0
final = ''
for year in years:
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
if numer and denom:
# Numerators
numerators = pd.DataFrame()
for colSubString in numer:
colName = list(filter(lambda x: colSubString in x, df.columns))[0]
numerators = addKey(df, numerators, colName)
# Denominators
denominators = pd.DataFrame()
for colSubString in denom:
colName = list(filter(lambda x: colSubString in x, df.columns))[0]
denominators = addKey(df, denominators, colName)
# Run the Calculation
fi = pd.DataFrame()
df['numerator'] = numerators.sum(axis=1)
df['denominator'] = denominators.sum(axis=1)
df = df[df['denominator'] != 0] # Delete Rows where the 'denominator' column is 0
df['final'] = (df['numerator'] / df['denominator'] ) * 100
df = df.add_prefix('20'+year+'_')
newTract = '20'+year+'_tract'
df = df.rename(columns={newTract: 'tract'} )
if count == 0: count = count+1; final = df
else: final = final.merge(df, on='tract')
print('Table: ' + tableId + ', Year: ' + year + ' recieved.')
print('Download Complete')
# Merge crosswalk with the Names data
if isinstance(namesCw, pd.DataFrame):
print('local boundary crosswalk matching on '+lcw2Match)
print('foreign boundary crosswalk matching on '+fcw2Match)
print('Pulling from foreign boundary crosswalk '+fcw2Want+' with Columns')
final = crosswalkIt( namesCw, final, lcw1Match, fcw1Match, fcw1Want, saveCw1, cwlk1FileName )
# Group and Aggregate
if groupBy:
df = df.groupby(groupBy)
print('Grouped')
if aggMethod == 'sum':
df = sumInts(df)
else:
df = sumInts(df)
print('Aggregated')
# Merge crosswalk with the Geom data
print('local boundary crosswalk matching on '+lcw2Match)
print('foreign boundary crosswalk matching on '+fcw2Match)
print('Pulling from foreign boundary crosswalk '+fcw2Want+' with Columns')
print(boundsCw.columns)
final = crosswalkIt( boundsCw, final, lcw2Match, fcw2Match, fcw2Want, saveCw2, cwlk2FileName )
final = GeoDataFrame(final, geometry=fcw2Want)
return final
fnl = getAcsYears(state, county, tract, years, tableId, numer, denom,
namesCw, names_local_match_col, names_foreign_match_col, names_foreign_wanted_col, groupBy, aggMethod,
boundsCw, bounds_local_match_col, bounds_foreign_match_col, bounds_foreign_wanted_col,
saveAcs, saveCw1, cwlk1FileName, saveCw2, cwlk2FileName)
fnl.plot()
fnl.columns
fnl.head()
Data was successfully merged across all years and geometry.
Now we want the tractname, geometry, and the specific column we want to make a gif from.
td = fnl.filter(regex="final|tract|geometry")
td = td.reindex(sorted(td.columns), axis=1)
td.head()
# Get Min Max
mins = []
maxs = []
for col in td.columns:
if col in ['NAME', 'state', 'county', 'tract', 'geometry'] :
pass
else:
mins.append(td[col].min())
maxs.append(td[col].max())
print(mins, maxs)
# set the min and max range for the choropleth map
vmin, vmax = min(mins), max(maxs)
merged = td
fileNames = []
annotation = 'Source: Baltimore Neighborhood Indicators Allianace - Jacob France Institute, 2019'
saveGifAs = './TESTGIF.gif'
labelBounds = True
specialLabelCol = False
# For each column
for indx, col in enumerate(merged.columns):
if col in ['NAME', 'state', 'county', 'tract', 'geometry'] :
pass
else:
print('Col Index: ', indx)
print('Col Name: '+str(col) )
# create map, UDPATE: added plt.Normalize to keep the legend range the same for all maps
fig = merged.plot(column=col, cmap='Blues', figsize=(10,10),
linewidth=0.8, edgecolor='0.8', vmin=vmin, vmax=vmax,
legend=True, norm=plt.Normalize(vmin=vmin, vmax=vmax)
)
print('Fig Created')
# https://stackoverflow.com/questions/38899190/geopandas-label-polygons
if labelBounds:
labelColumn = col
if specialLabelCol: labelColumn = specialLabelCol
merged.apply(lambda x: fig.annotate(s=x[labelColumn], xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
# remove axis of chart and set title
fig.axis('off')
col = col.replace("_final", "").replace("_", " ")
# str(col.replace("_", " ")[12:])
fig.set_title(str(col), fontdict={'fontsize': '10', 'fontweight' : '3'})
print('Fig Titled')
# create an annotation for the data source
fig.annotate(annotation,
xy=(0.1, .08), xycoords='figure fraction',
horizontalalignment='left', verticalalignment='top',
fontsize=10, color='#555555')
print('Fig annotated')
# this will save the figure as a high-res png in the output path. you can also save as svg if you prefer.
# filepath = os.path.join(output_path, image_name)
image_name = 'test_'+col+'.jpg'
fileNames.append(image_name)
chart = fig.get_figure()
# fig.savefig(“map_export.png”, dpi=300)
chart.savefig(image_name, dpi=300)
plt.close(chart)
images = []
for filename in fileNames:
images.append(imageio.imread(filename))
imageio.mimsave(saveGifAs, images, fps=.5)
Access Google Drive directories:
from google.colab import drive
drive.mount("/content/drive")
You can also import file directly into a temporary folder in a public folder
from google.colab import files
# Just uncommment this line and run the cell
# uploaded = files.upload()
Now lets explore the file system using the built in terminal:
By default you are positioned in the ./content/ folder.
cd ./drive/My Drive/colabs/DATA
ls
for file in os.listdir("./"):
if file.endswith(".geojson"):
print(file)
The next two function will help navigating the file directories:
# Find Relative Path to Files
def findFile(root, file):
for d, subD, f in os.walk(root):
if file in f:
return "{1}/{0}".format(file, d)
break
# To 'import' a script you wrote, map its filepath into the sys
def addPath(root, file): sys.path.append(os.path.abspath( findFile( './', file) ))