# default_exp merge
This Coding Notebook is the second in a series.
An Interactive version can be found here
.
This colab and more can be found on our webpage.
-
Content covered in previous tutorials will be used in later tutorials.
-
New code and or information should have explanations and or descriptions attached.
-
Concepts or code covered in previous tutorials will be used without being explaining in entirety.
-
The Dataplay Handbook development techniques covered in the Datalabs Guidebook
-
If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.
-
This notebook has been optimized for Google Colabs ran on a Chrome Browser.
-
Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.


About this Tutorial:
Whats Inside?
The Tutorial
In this notebook, the basics of how to perform a merge are introduced.
- We will merge two datasets
- We will merge two datasets using a crosswalk
Objectives
By the end of this tutorial users should have an understanding of:
- How dataset merges are performed
- The types different union approaches a merge can take
- The 'mergeData' function, and how to use it in the future
Guided Walkthrough
SETUP
Install these libraries onto the virtual environment.
!pip install geopandas
!pip install VitalSigns#hide
t = """
!pip install nbdev
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/'Software Development Documents'/dataplay/
"""
# !pip install dataplay#export
from dataplay.intaker import Intake
from VitalSigns.acsDownload import retrieve_acs_data%load_ext autoreload
%autoreload 2The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
ls[0m[01;34mbuild[0m/ [01;34mdataplay[0m/ [01;34mdocs[0m/ Makefile [01;34mnotebooks[0m/ settings.ini
CONTRIBUTING.md [01;34mdist[0m/ LICENSE MANIFEST.in README.md setup.py
!nbdev_update_lib -husage: nbdev_update_lib
[-h]
[--fname FNAME]
[--silent SILENT]
Propagates
any change
in the
modules
matching
`fname` to
the
notebooks
that
created
them
optional arguments:
-h, --help
show this
help
message and
exit
--fname FNAME
A python
filename or
glob to
convert
--silent SILENT
Don't print
results
(default:
False)
ls[0m[01;34mbuild[0m/ [01;34mdataplay[0m/ [01;34mdocs[0m/ Makefile [01;34mnotebooks[0m/ settings.ini
CONTRIBUTING.md [01;34mdist[0m/ LICENSE MANIFEST.in README.md setup.py
!nbdev_build_libConverted 01_Download_and_Load.ipynb.
Converted 02_Merge_Data.ipynb.
Converted 03_Map_Basics_Intake_and_Operations.ipynb.
Converted 04_nb_2_html.ipynb.
Converted 05_Map_Correlation_Networks.ipynb.
Converted 06_Timelapse_Data_Gifs.ipynb.
Converted index.ipynb.
# individuals not working
!nbdev_update_libConverted intaker.py.
Converted merge.py.
Converted geoms.py.
Converted gifmap.py.
# !nbdev_build_lib --fname './notebooks/01_Download_and_Load.ipynb'
# !nbdev_build_lib --fname './notebooks/02_Merge_Data.ipynb'Converted 01_Download_and_Load.ipynb.
#export
# @title Run: Import Modules
# These imports will handle everything
import numpy as np
import pandas as pd# hide
pd.set_option('display.max_colwidth', -1)
pd.set_option('max_colwidth', 20)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)Retrieve Datasets
Our example will merge two simple datasets; pulling CSA names using tract ID's.
The First dataset will be obtained from the Census' ACS 5-year serveys.
Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.
The Second dataset is from a publicly accessible link
Get the Principal dataset.
We will use the function we created in our last tutorial to download the data!
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'
# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = Falsedf = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()Number of Columns 17
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
tract |
NAME |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Census Tract 2710.02 |
1510 |
209 |
73 |
94 |
97 |
110 |
119 |
97 |
65 |
36 |
149 |
168 |
106 |
66 |
44 |
50 |
27 |
24 |
510 |
271002 |
Census Tract 2604.02 |
1134 |
146 |
29 |
73 |
80 |
41 |
91 |
49 |
75 |
81 |
170 |
57 |
162 |
63 |
11 |
6 |
0 |
24 |
510 |
260402 |
Census Tract 2712 |
2276 |
69 |
43 |
41 |
22 |
46 |
67 |
0 |
30 |
30 |
80 |
146 |
321 |
216 |
139 |
261 |
765 |
24 |
510 |
271200 |
Census Tract 2804.04 |
961 |
111 |
108 |
61 |
42 |
56 |
37 |
73 |
30 |
31 |
106 |
119 |
74 |
23 |
27 |
24 |
39 |
24 |
510 |
280404 |
Census Tract 901 |
1669 |
158 |
124 |
72 |
48 |
108 |
68 |
121 |
137 |
99 |
109 |
191 |
160 |
141 |
28 |
88 |
17 |
24 |
510 |
90100 |
Get the Secondary Dataset
Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for
Tiger/Line Shapefiles
The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov
For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.
# !wget https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csvor, Alternately
# !curl https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv > CSA-to-Tract-2010.csvprint('Boundaries Example:CSA-to-Tract-2010.csv')Boundaries Example:CSA-to-Tract-2010.csv
# Get the Second dataset.
# Our Example dataset contains Polygon Geometry information.
# We want to merge this over to our principle dataset.
# we will grab it by matching on either CSA or Tract
# The url listed below is public.
print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv')
from dataplay.intaker import Intake
crosswalk = Intake.getData( 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' )
crosswalk.head()Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv
|
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
10100 |
24510010100 |
Canton |
1 |
10200 |
24510010200 |
Patterson Park N... |
2 |
10300 |
24510010300 |
Canton |
3 |
10400 |
24510010400 |
Canton |
4 |
10500 |
24510010500 |
Fells Point |
crosswalk.columnsIndex(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')Perform Merge & Save
The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.
- Left - returns all left records, only includes the right record if it has a match
- Right - Returns all right records, only includes the left record if it has a match
- Full - Returns all records regardless of keys matching
- Inner - Returns only records where a key match
Get Columns from both datasets to match on
You can get these values from the column values above.
Our Examples will work with the prompted values
print( 'Princpal Columns ' + str(df.columns) + '')
left_on = input("Left on crosswalk column: ('tract') \n" ) or "tract"
print(' \n ');
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) or "TRACTCE10" Princpal Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
'B19001_003E_Total_$10,000_to_$14,999',
'B19001_004E_Total_$15,000_to_$19,999',
'B19001_005E_Total_$20,000_to_$24,999',
'B19001_006E_Total_$25,000_to_$29,999',
'B19001_007E_Total_$30,000_to_$34,999',
'B19001_008E_Total_$35,000_to_$39,999',
'B19001_009E_Total_$40,000_to_$44,999',
'B19001_010E_Total_$45,000_to_$49,999',
'B19001_011E_Total_$50,000_to_$59,999',
'B19001_012E_Total_$60,000_to_$74,999',
'B19001_013E_Total_$75,000_to_$99,999',
'B19001_014E_Total_$100,000_to_$124,999',
'B19001_015E_Total_$125,000_to_$149,999',
'B19001_016E_Total_$150,000_to_$199,999',
'B19001_017E_Total_$200,000_or_more', 'state', 'county', 'tract'],
dtype='object')
Left on crosswalk column: ('tract')
Crosswalk Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Right on crosswalk column: ('TRACTCE10')
Specify how the merge will be performed
We will perform a left merge in this example.
It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " ) or 'outer'How: (‘left’, ‘right’, ‘outer’, ‘inner’)
Actually perfrom the merge
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
568 |
128 |
24 |
44 |
41 |
4 |
25 |
20 |
0 |
21 |
18 |
31 |
129 |
32 |
37 |
9 |
5 |
24 |
510 |
130805.0 |
2.45e+10 |
Mount Washington... |
1 |
967 |
99 |
40 |
25 |
81 |
77 |
21 |
42 |
47 |
5 |
41 |
100 |
132 |
68 |
79 |
82 |
28 |
24 |
510 |
210100.0 |
2.45e+10 |
Washington Villa... |
2 |
1139 |
191 |
93 |
81 |
67 |
63 |
41 |
51 |
76 |
68 |
50 |
120 |
114 |
22 |
0 |
33 |
69 |
24 |
510 |
270701.0 |
2.45e+10 |
Harford/Echodale |
3 |
808 |
195 |
114 |
80 |
49 |
76 |
81 |
26 |
70 |
0 |
33 |
19 |
31 |
13 |
0 |
15 |
6 |
24 |
510 |
190100.0 |
2.45e+10 |
Southwest Baltimore |
4 |
698 |
58 |
69 |
131 |
32 |
39 |
26 |
19 |
24 |
32 |
60 |
74 |
55 |
5 |
34 |
21 |
19 |
24 |
510 |
190200.0 |
2.45e+10 |
Southwest Baltimore |
As you can see, our Census data will now have a CSA appended to it.
# Save Data to User Specified File
# outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
# merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) Final Result
flag = input("Enter a URL? If not ACS data will be used. (Y/N): " ) or "N"
if (flag == 'y' or flag == 'Y'):
left_df = Intake.getData( input("Please enter the location of your Left file: " ) )
else:
tract = input("Please enter tract id (*): " ) or "*"
county = input("Please enter county id (510): " ) or "510"
state = input("Please enter state id (24): " ) or "24"
tableId = input("Please enter acs table id (B19001): " ) or "B19001"
year = input("Please enter acs year (18): " ) or "18"
saveAcs = input("Save ACS? (Y/N): " ) or "N"
left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
print('right_df Example: CSA-to-Tract-2010.csv')
right_df = Intake.getData( input("Please enter the location of your right_df file: " ) or 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' )
print( 'Left Columns ' + str(left_df.columns))
print( '\n ')
print( 'right_df Columns ' + str(right_df.columns) + '\n')
left_on = input("Left on: " ) or 'tract'
right_on = input("Right on: " ) or 'TRACTCE10'
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " ) or 'outer'
merged_df = pd.merge(left_df, right_df, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ") or 'N'
if saveFile == 'Y' or saveFile == 'y':
outFile = input("Saved Filename (Do not include the file extension ): ")
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);Enter a URL? If not ACS data will be used. (Y/N):
Please enter tract id (*):
Please enter county id (510):
Please enter state id (24):
Please enter acs table id (B19001):
Please enter acs year (18):
Save ACS? (Y/N):
Number of Columns 17
right_df Example: CSA-to-Tract-2010.csv
Please enter the location of your right_df file:
Left Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
'B19001_003E_Total_$10,000_to_$14,999',
'B19001_004E_Total_$15,000_to_$19,999',
'B19001_005E_Total_$20,000_to_$24,999',
'B19001_006E_Total_$25,000_to_$29,999',
'B19001_007E_Total_$30,000_to_$34,999',
'B19001_008E_Total_$35,000_to_$39,999',
'B19001_009E_Total_$40,000_to_$44,999',
'B19001_010E_Total_$45,000_to_$49,999',
'B19001_011E_Total_$50,000_to_$59,999',
'B19001_012E_Total_$60,000_to_$74,999',
'B19001_013E_Total_$75,000_to_$99,999',
'B19001_014E_Total_$100,000_to_$124,999',
'B19001_015E_Total_$125,000_to_$149,999',
'B19001_016E_Total_$150,000_to_$199,999',
'B19001_017E_Total_$200,000_or_more', 'state', 'county', 'tract'],
dtype='object')
right_df Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Left on:
Right on:
How: (‘left’, ‘right’, ‘outer’, ‘inner’)
Save File ('Y' or 'N'):
merged_df.head(1)
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
568 |
128 |
24 |
44 |
41 |
4 |
25 |
20 |
0 |
21 |
18 |
31 |
129 |
32 |
37 |
9 |
5 |
24 |
510 |
130805.0 |
2.45e+10 |
Mount Washington... |
Advanced
For this next example to work, we will need to import hypothetical csv files
Intro
The following Python function is a bulked out version of the previous notes.
- It contains everything from the tutorial plus more.
- It can be imported and used in future projects or stand alone.
Description: add columns of data from a foreign dataset into a primary dataset along set parameters.
Purpose: Makes Merging datasets simple
Services
- Merge two datasets without a crosswalk
- Merge two datasets with a crosswalk
#export
#@ title Run: Create mergeDatasets()
# Worried about infinit interactive-loops. not an issue atm.
# Crosswalk needs to have exact same column names as left/right datasets
def mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False,
left_col=False, right_col=False,
crosswalk_left_col = False, crosswalk_right_col = False,
merge_how=False, # left right or columnname to retrieve
interactive=True):
# Interactive will ask if use_crosswalk unless crosswalk_ds == 'no'
# 1. Used on Right Dataset in case merge_how is a column to pull. Returns False or Col
def checkMergeHow(ds, how, interactive):
inList = how in ['left', 'right', 'outer', 'inner']
inDf = Intake.checkColumn(ds, how)
if ( inList or inDf ): return how
elif ( not interactive ): return False
else:
try:
print('\n Invalid merge column given. \n Please select a value from either list');
print("\n 1) Pull A single column from the right dataset: ", ds.columns)
print("OR \n 2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName) " )
return checkMergeHow(ds, input("Column Name: " ), interactive);
except: return False # User probably trying to escape interactivity
# 2i. Load data via url. Coerce Dtype needed for merge.
def coerceForMerge( msg, first_ds, second_ds, first_col, second_col, interactive ):
if (interactive):
print(f'\n---Casting Datatypes from-to: {msg} Datasets---');
print('Before Casting: ');
print('-> Column One: ', first_col, first_ds[first_col].dtype)
print('-> Column Two: ', second_col, second_ds[second_col].dtype)
second_ds, second_col = Intake.getAndCheck(second_ds, second_col, interactive)
first_ds, second_ds, status = Intake.coerce(first_ds, second_ds, first_col, second_col, interactive);
if (not status and interactive): print('\n There was a problem!');
if (interactive):
print('\n After Casting: ');
print('-> Column One: ', first_col, first_ds[first_col].dtype)
print('-> Column Two: ', second_col, second_ds[second_col].dtypes)
return first_ds, second_ds, second_col, status
# 2ii.
def mergeAndFilter(msg, first_ds, second_ds, first_col, second_col, how, interactive):
if interactive:
print(f'---PERFORMING MERGE : {msg}---');
print('Column One : ', first_col, first_ds[first_col].dtype)
print('How: ', how)
print('Column Two : ', second_col, second_ds[second_col].dtype)
first_ds = mergeOrPull(first_ds, second_ds, first_col, second_col, how)
return filterEmpties(first_ds, second_ds, first_col, second_col, how, interactive)
# Decide to perform a merge or commit a pull
def mergeOrPull(df, cw, left_on, right_on, how):
def merge(df, cw, left_on, right_on, how):
df = pd.merge(df, cw, left_on=left_on, right_on=right_on, how=how)
# df.drop(left_on, axis=1)
df[right_on] = df[right_on].fillna(value='empty')
return df
def pull(df, cw, left_on, right_on, how):
crswlk = dict(zip(cw[right_on], cw[how] ) )
dtype = df[left_on].dtype
if dtype =='object': df[how] = df.apply(lambda row: crswlk.get(str(row[left_on]), "empty"), axis=1)
elif dtype == 'int64':
df[how] = df.apply(lambda row: crswlk.get(int(row[left_on]), "empty"), axis=1)
return df
mergeType = how in ['left', 'right', 'outer', 'inner']
if mergeType: return merge(df, cw, left_on, right_on, how)
else: return pull(df, cw, left_on, right_on, how)
# 2iiii. Filter between matched records and not.
def filterEmpties(df, cw, left_on, right_on, how, interactive):
if how in ['left', 'right', 'outer', 'inner']: how = right_on
nomatch = df.loc[df[how] == 'empty']
nomatch = nomatch.sort_values(by=left_on, ascending=True)
if nomatch.shape[0] > 0:
# Do the same thing with our foreign tracts
if(interactive):
print('\n Local Column Values Not Matched ')
print(nomatch[left_on].unique() )
print(len(nomatch[left_on]))
print('\n Crosswalk Unique Column Values')
print(cw[right_on].unique() )
# Create a new column with the tracts value mapped to its corresponding value from the crossswalk
df[how].replace('empty', np.nan, inplace=True)
df.dropna(subset=[how], inplace=True)
# cw = cw.sort_values(by=how, ascending=True)
return df
# 0. Retrieve the left and right dataset.
if (interactive): print('---Handling Left Dataset Options---');
left_ds, left_col = Intake.getAndCheck(left_ds, left_col, interactive)
if (interactive): print('Left column:', left_col)
if (interactive): print('\n---Handling Right Dataset Options---');
right_ds, right_col = Intake.getAndCheck(right_ds, right_col, interactive)
if (interactive): print('Right column:', left_col)
if (interactive): print(f"\n---Ensuring Compatability Between merge_how (val: '{merge_how}') and the Right Dataset---");
merge_how = checkMergeHow(right_ds, merge_how, interactive)
if (interactive): print("Column or ['inner','left','right','outer'] value: ", merge_how)
# 1. Retrieve the crosswalk dataset: check left-cw, right-cw. try coercing.
if (interactive): print(f'\n---Checking Crosswalk Dataset Options---')
# if its a df
if (not Intake.isPandas(crosswalk_ds)):
default = str(crosswalk_ds).lower() == 'false'
# If the user used the the default crosswalk value 'False' as them if they want to use one.
if (default and interactive ): crosswalk_ds = input("\nProvide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' ) ") or False
# Check if user opted to not use a crosswalk
use_crosswalk = not ((str(crosswalk_ds).lower() in ["no", '', 'none', 'false']))
if (use_crosswalk):
crosswalk_ds, crosswalk_left_col = Intake.getAndCheck(crosswalk_ds, crosswalk_left_col, interactive)
crosswalk_ds, crosswalk_right_col = Intake.getAndCheck(crosswalk_ds, crosswalk_right_col, interactive)
# 3. Coerce all datasets for Merge.
if ( Intake.isPandas(crosswalk_ds) ):
print('crosswalk_left_col',crosswalk_left_col)
left_ds, crosswalk_ds, crosswalk_left_col, status = coerceForMerge( 'Left->Crosswalk', left_ds, crosswalk_ds, left_col, crosswalk_left_col, interactive )
right_ds, crosswalk_ds, crosswalk_right_col, status = coerceForMerge( 'Right->Crosswalk',right_ds, crosswalk_ds, right_col, crosswalk_right_col, interactive )
else:
left_ds, right_ds, right_col, status = coerceForMerge('Left->Right', left_ds, right_ds, left_col, right_col, interactive )
if (interactive): print('\n---All checks complete. Status: ', status, '---\n');
if ( not status ):
if (interactive):print('Merge Incomplete. Thank you!');
return False;
else:
if (Intake.isPandas(crosswalk_ds)):
left_ds = mergeAndFilter('LEFT->CROSSWALK', left_ds, crosswalk_ds, left_col, crosswalk_left_col, crosswalk_right_col, interactive)
left_col = crosswalk_right_col
left_ds = mergeAndFilter('LEFT->RIGHT', left_ds, right_ds, left_col, right_col, merge_how, interactive)
return left_dsFunction Explanation
Input(s):
- Dataset url
- Crosswalk Url
- Right On
- Left On
- How
- New Filename
Output: File
How it works:
-
Read in datasets
-
Perform Merge
-
If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']
-
- then a merge will be performed.
-
If a column name is provided in the 'how' parameter
-
- then that single column will be pulled from the right dataset as a new column in the left_ds.
Function Diagrams
Diagram the mergeDatasets()

mergeDatasets Flow Chart

Gannt Chart mergeDatasets()

Sequence Diagram mergeDatasets()

Function Examples
# from VitalSigns.acsDownload import retrieve_acs_data
# from dataplay.geoms import readInGeometryData Interactive Example 1. Merge Esri Data
# Table: Household Childhood Poverty
# Hhchpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
# Hhchpov = Hhchpov[['CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17', 'hhchpov18']]
left_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
left_col = 'CSA2010'
# Table: Household Poverty
# Hhpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
# Hhpov = Hhpov[['CSA2010', 'hhpov15', 'hhpov16', 'hhpov17', 'hhpov18']]
right_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
right_col='CSA2010'
merge_how = 'outer'
interactive = False
merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds='no',
left_col=left_col, right_col=right_col,
crosswalk_left_col = False, crosswalk_right_col = False,
merge_how=merge_how, # left right or columnname to retrieve
interactive=interactive)
merged_df.head()
|
OBJECTID_x |
CSA2010 |
hhchpov15 |
hhchpov16 |
hhchpov17 |
hhchpov18 |
hhchpov19 |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.60 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
24.15 |
21.28 |
20.70 |
23.00 |
19.18 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
2 |
Beechfield/Ten H... |
19.42 |
21.22 |
23.92 |
21.90 |
15.38 |
4.79e+07 |
37524.95 |
POLYGON ((-76.69... |
2 |
11.17 |
11.59 |
10.47 |
10.90 |
8.82 |
4.79e+07 |
37524.95 |
POLYGON ((-76.69... |
2 |
3 |
Belair-Edison |
36.88 |
36.13 |
34.56 |
39.74 |
41.04 |
4.50e+07 |
31307.31 |
POLYGON ((-76.56... |
3 |
18.61 |
19.59 |
20.27 |
22.83 |
22.53 |
4.50e+07 |
31307.31 |
POLYGON ((-76.56... |
3 |
4 |
Brooklyn/Curtis ... |
45.01 |
46.45 |
46.41 |
39.89 |
41.39 |
1.76e+08 |
150987.70 |
MULTIPOLYGON (((... |
4 |
28.36 |
26.33 |
24.21 |
21.54 |
24.60 |
1.76e+08 |
150987.70 |
MULTIPOLYGON (((... |
4 |
5 |
Canton |
5.49 |
2.99 |
4.02 |
4.61 |
4.83 |
1.54e+07 |
23338.61 |
POLYGON ((-76.57... |
5 |
3.00 |
2.26 |
3.66 |
2.05 |
2.22 |
1.54e+07 |
23338.61 |
POLYGON ((-76.57... |
Example 2 ) Get CSA and Geometry with a Crosswalk using 3 links
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510' # '059' # 153 '510'
state = '24' #51
# Specify the download parameters the function will receieve here
tableId = 'B19049' # 'B19001'
year = '17'
saveAcs = False import pandas as pd
import IPython
# from IPython.core.display import HTML
IPython.core.display.HTML("")
# state, county, tract, tableId, year, saveOriginal, save
left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
left_df.head(1) Number of Columns 5
|
B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total |
B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years |
B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years |
B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years |
B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over |
state |
county |
tract |
NAME |
|
|
|
|
|
|
|
|
Census Tract 2710.02 |
38358 |
-666666666 |
34219 |
40972 |
37143 |
24 |
510 |
271002 |
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = left_df
left_col = 'tract'
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv'
crosswalk_left_col = 'TRACTCE10'
crosswalk_right_col = 'CSA2010'
# Table:
right_ds = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
right_col = 'CSA2010'
interactive = False
merge_how = 'outer'
merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=crosswalk_ds,
left_col=left_col, right_col=right_col,
crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
merge_how=merge_how, # left right or columnname to retrieve
interactive=interactive)
merged_df_geom.head()crosswalk_left_col TRACTCE10
|
B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total |
B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years |
B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years |
B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years |
B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over |
state |
county |
tract |
CSA2010 |
OBJECTID |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
Shape__Area |
Shape__Length |
geometry |
0 |
38358 |
-666666666 |
34219 |
40972 |
37143 |
24 |
510 |
271002 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
1 |
44904 |
-666666666 |
51324 |
42083 |
37269 |
24 |
510 |
90100 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
2 |
35707 |
2499 |
42292 |
37361 |
29191 |
24 |
510 |
271001 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
3 |
42231 |
-666666666 |
46467 |
45484 |
18750 |
24 |
510 |
260402 |
Claremont/Armistead |
9.0 |
21.27 |
23.59 |
24.00 |
24.64 |
23.88 |
6.12e+07 |
40104.42 |
POLYGON ((-76.52... |
4 |
31657 |
20800 |
26074 |
60959 |
37396 |
24 |
510 |
260403 |
Claremont/Armistead |
9.0 |
21.27 |
23.59 |
24.00 |
24.64 |
23.88 |
6.12e+07 |
40104.42 |
POLYGON ((-76.52... |
Here we can save the data so that it may be used in later tutorials.
# string = 'test_save_data_with_geom_and_csa'
# merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)Example 3: Ran Alone
mergeDatasets( ).head(1)---Handling Left Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Left column: CSA2010
---Handling Right Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Right column: CSA2010
---Ensuring Compatability Between merge_how (val: 'False') and the Right Dataset---
Invalid merge column given.
Please select a value from either list
1) Pull A single column from the right dataset: Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
OR
2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName)
Column Name: inner
Column or ['inner','left','right','outer'] value: inner
---Checking Crosswalk Dataset Options---
Provide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' )
---Casting Datatypes from-to: Left->Right Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---All checks complete. Status: True ---
---PERFORMING MERGE : LEFT->RIGHT---
Column One : CSA2010 object
How: inner
Column Two : CSA2010 object
|
OBJECTID_x |
CSA2010 |
hhchpov15_x |
hhchpov16_x |
hhchpov17_x |
hhchpov18_x |
hhchpov19_x |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhchpov15_y |
hhchpov16_y |
hhchpov17_y |
hhchpov18_y |
hhchpov19_y |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
mergeDatasets().head(1)---Handling Left Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Left column: CSA2010
---Handling Right Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Right column: CSA2010
---Ensuring Compatability Between merge_how (val: 'False') and the Right Dataset---
Invalid merge column given.
Please select a value from either list
1) Pull A single column from the right dataset: Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
OR
2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName)
Column Name: inner
Column or ['inner','left','right','outer'] value: inner
---Checking Crosswalk Dataset Options---
Provide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
crosswalk_left_col CSA2010
---Casting Datatypes from-to: Left->Crosswalk Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---Casting Datatypes from-to: Right->Crosswalk Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---All checks complete. Status: True ---
---PERFORMING MERGE : LEFT->CROSSWALK---
Column One : CSA2010 object
How: CSA2010
Column Two : CSA2010 object
---PERFORMING MERGE : LEFT->RIGHT---
Column One : CSA2010 object
How: inner
Column Two : CSA2010 object
|
OBJECTID_x |
CSA2010 |
hhchpov15_x |
hhchpov16_x |
hhchpov17_x |
hhchpov18_x |
hhchpov19_x |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhchpov15_y |
hhchpov16_y |
hhchpov17_y |
hhchpov18_y |
hhchpov19_y |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
# default_exp mergeThis Coding Notebook is the second in a series.
An Interactive version can be found here
.
This colab and more can be found on our webpage.
-
Content covered in previous tutorials will be used in later tutorials.
-
New code and or information should have explanations and or descriptions attached.
-
Concepts or code covered in previous tutorials will be used without being explaining in entirety.
-
The Dataplay Handbook development techniques covered in the Datalabs Guidebook
-
If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.
-
This notebook has been optimized for Google Colabs ran on a Chrome Browser.
-
Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.


About this Tutorial:
Whats Inside?
The Tutorial
In this notebook, the basics of how to perform a merge are introduced.
- We will merge two datasets
- We will merge two datasets using a crosswalk
Objectives
By the end of this tutorial users should have an understanding of:
- How dataset merges are performed
- The types different union approaches a merge can take
- The 'mergeData' function, and how to use it in the future
Guided Walkthrough
SETUP
Install these libraries onto the virtual environment.
!pip install geopandas
!pip install VitalSigns#hide
t = """
!pip install nbdev
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/'Software Development Documents'/dataplay/
"""
# !pip install dataplay#export
from dataplay.intaker import Intake
from VitalSigns.acsDownload import retrieve_acs_data%load_ext autoreload
%autoreload 2The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
ls[0m[01;34mbuild[0m/ [01;34mdataplay[0m/ [01;34mdocs[0m/ Makefile [01;34mnotebooks[0m/ settings.ini
CONTRIBUTING.md [01;34mdist[0m/ LICENSE MANIFEST.in README.md setup.py
!nbdev_update_lib -husage: nbdev_update_lib
[-h]
[--fname FNAME]
[--silent SILENT]
Propagates
any change
in the
modules
matching
`fname` to
the
notebooks
that
created
them
optional arguments:
-h, --help
show this
help
message and
exit
--fname FNAME
A python
filename or
glob to
convert
--silent SILENT
Don't print
results
(default:
False)
ls[0m[01;34mbuild[0m/ [01;34mdataplay[0m/ [01;34mdocs[0m/ Makefile [01;34mnotebooks[0m/ settings.ini
CONTRIBUTING.md [01;34mdist[0m/ LICENSE MANIFEST.in README.md setup.py
!nbdev_build_libConverted 01_Download_and_Load.ipynb.
Converted 02_Merge_Data.ipynb.
Converted 03_Map_Basics_Intake_and_Operations.ipynb.
Converted 04_nb_2_html.ipynb.
Converted 05_Map_Correlation_Networks.ipynb.
Converted 06_Timelapse_Data_Gifs.ipynb.
Converted index.ipynb.
# individuals not working
!nbdev_update_libConverted intaker.py.
Converted merge.py.
Converted geoms.py.
Converted gifmap.py.
# !nbdev_build_lib --fname './notebooks/01_Download_and_Load.ipynb'
# !nbdev_build_lib --fname './notebooks/02_Merge_Data.ipynb'Converted 01_Download_and_Load.ipynb.
#export
# @title Run: Import Modules
# These imports will handle everything
import numpy as np
import pandas as pd# hide
pd.set_option('display.max_colwidth', -1)
pd.set_option('max_colwidth', 20)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)Retrieve Datasets
Our example will merge two simple datasets; pulling CSA names using tract ID's.
The First dataset will be obtained from the Census' ACS 5-year serveys.
Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.
The Second dataset is from a publicly accessible link
Get the Principal dataset.
We will use the function we created in our last tutorial to download the data!
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'
# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = Falsedf = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()Number of Columns 17
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
tract |
NAME |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Census Tract 2710.02 |
1510 |
209 |
73 |
94 |
97 |
110 |
119 |
97 |
65 |
36 |
149 |
168 |
106 |
66 |
44 |
50 |
27 |
24 |
510 |
271002 |
Census Tract 2604.02 |
1134 |
146 |
29 |
73 |
80 |
41 |
91 |
49 |
75 |
81 |
170 |
57 |
162 |
63 |
11 |
6 |
0 |
24 |
510 |
260402 |
Census Tract 2712 |
2276 |
69 |
43 |
41 |
22 |
46 |
67 |
0 |
30 |
30 |
80 |
146 |
321 |
216 |
139 |
261 |
765 |
24 |
510 |
271200 |
Census Tract 2804.04 |
961 |
111 |
108 |
61 |
42 |
56 |
37 |
73 |
30 |
31 |
106 |
119 |
74 |
23 |
27 |
24 |
39 |
24 |
510 |
280404 |
Census Tract 901 |
1669 |
158 |
124 |
72 |
48 |
108 |
68 |
121 |
137 |
99 |
109 |
191 |
160 |
141 |
28 |
88 |
17 |
24 |
510 |
90100 |
Get the Secondary Dataset
Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for
Tiger/Line Shapefiles
The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov
For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.
# !wget https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csvor, Alternately
# !curl https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv > CSA-to-Tract-2010.csvprint('Boundaries Example:CSA-to-Tract-2010.csv')Boundaries Example:CSA-to-Tract-2010.csv
# Get the Second dataset.
# Our Example dataset contains Polygon Geometry information.
# We want to merge this over to our principle dataset.
# we will grab it by matching on either CSA or Tract
# The url listed below is public.
print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv')
from dataplay.intaker import Intake
crosswalk = Intake.getData( 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' )
crosswalk.head()Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv
|
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
10100 |
24510010100 |
Canton |
1 |
10200 |
24510010200 |
Patterson Park N... |
2 |
10300 |
24510010300 |
Canton |
3 |
10400 |
24510010400 |
Canton |
4 |
10500 |
24510010500 |
Fells Point |
crosswalk.columnsIndex(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')Perform Merge & Save
The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.
- Left - returns all left records, only includes the right record if it has a match
- Right - Returns all right records, only includes the left record if it has a match
- Full - Returns all records regardless of keys matching
- Inner - Returns only records where a key match
Get Columns from both datasets to match on
You can get these values from the column values above.
Our Examples will work with the prompted values
print( 'Princpal Columns ' + str(df.columns) + '')
left_on = input("Left on crosswalk column: ('tract') \n" ) or "tract"
print(' \n ');
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) or "TRACTCE10" Princpal Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
'B19001_003E_Total_$10,000_to_$14,999',
'B19001_004E_Total_$15,000_to_$19,999',
'B19001_005E_Total_$20,000_to_$24,999',
'B19001_006E_Total_$25,000_to_$29,999',
'B19001_007E_Total_$30,000_to_$34,999',
'B19001_008E_Total_$35,000_to_$39,999',
'B19001_009E_Total_$40,000_to_$44,999',
'B19001_010E_Total_$45,000_to_$49,999',
'B19001_011E_Total_$50,000_to_$59,999',
'B19001_012E_Total_$60,000_to_$74,999',
'B19001_013E_Total_$75,000_to_$99,999',
'B19001_014E_Total_$100,000_to_$124,999',
'B19001_015E_Total_$125,000_to_$149,999',
'B19001_016E_Total_$150,000_to_$199,999',
'B19001_017E_Total_$200,000_or_more', 'state', 'county', 'tract'],
dtype='object')
Left on crosswalk column: ('tract')
Crosswalk Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Right on crosswalk column: ('TRACTCE10')
Specify how the merge will be performed
We will perform a left merge in this example.
It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " ) or 'outer'How: (‘left’, ‘right’, ‘outer’, ‘inner’)
Actually perfrom the merge
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
568 |
128 |
24 |
44 |
41 |
4 |
25 |
20 |
0 |
21 |
18 |
31 |
129 |
32 |
37 |
9 |
5 |
24 |
510 |
130805.0 |
2.45e+10 |
Mount Washington... |
1 |
967 |
99 |
40 |
25 |
81 |
77 |
21 |
42 |
47 |
5 |
41 |
100 |
132 |
68 |
79 |
82 |
28 |
24 |
510 |
210100.0 |
2.45e+10 |
Washington Villa... |
2 |
1139 |
191 |
93 |
81 |
67 |
63 |
41 |
51 |
76 |
68 |
50 |
120 |
114 |
22 |
0 |
33 |
69 |
24 |
510 |
270701.0 |
2.45e+10 |
Harford/Echodale |
3 |
808 |
195 |
114 |
80 |
49 |
76 |
81 |
26 |
70 |
0 |
33 |
19 |
31 |
13 |
0 |
15 |
6 |
24 |
510 |
190100.0 |
2.45e+10 |
Southwest Baltimore |
4 |
698 |
58 |
69 |
131 |
32 |
39 |
26 |
19 |
24 |
32 |
60 |
74 |
55 |
5 |
34 |
21 |
19 |
24 |
510 |
190200.0 |
2.45e+10 |
Southwest Baltimore |
As you can see, our Census data will now have a CSA appended to it.
# Save Data to User Specified File
# outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
# merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) Final Result
flag = input("Enter a URL? If not ACS data will be used. (Y/N): " ) or "N"
if (flag == 'y' or flag == 'Y'):
left_df = Intake.getData( input("Please enter the location of your Left file: " ) )
else:
tract = input("Please enter tract id (*): " ) or "*"
county = input("Please enter county id (510): " ) or "510"
state = input("Please enter state id (24): " ) or "24"
tableId = input("Please enter acs table id (B19001): " ) or "B19001"
year = input("Please enter acs year (18): " ) or "18"
saveAcs = input("Save ACS? (Y/N): " ) or "N"
left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
print('right_df Example: CSA-to-Tract-2010.csv')
right_df = Intake.getData( input("Please enter the location of your right_df file: " ) or 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv' )
print( 'Left Columns ' + str(left_df.columns))
print( '\n ')
print( 'right_df Columns ' + str(right_df.columns) + '\n')
left_on = input("Left on: " ) or 'tract'
right_on = input("Right on: " ) or 'TRACTCE10'
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " ) or 'outer'
merged_df = pd.merge(left_df, right_df, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ") or 'N'
if saveFile == 'Y' or saveFile == 'y':
outFile = input("Saved Filename (Do not include the file extension ): ")
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);Enter a URL? If not ACS data will be used. (Y/N):
Please enter tract id (*):
Please enter county id (510):
Please enter state id (24):
Please enter acs table id (B19001):
Please enter acs year (18):
Save ACS? (Y/N):
Number of Columns 17
right_df Example: CSA-to-Tract-2010.csv
Please enter the location of your right_df file:
Left Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
'B19001_003E_Total_$10,000_to_$14,999',
'B19001_004E_Total_$15,000_to_$19,999',
'B19001_005E_Total_$20,000_to_$24,999',
'B19001_006E_Total_$25,000_to_$29,999',
'B19001_007E_Total_$30,000_to_$34,999',
'B19001_008E_Total_$35,000_to_$39,999',
'B19001_009E_Total_$40,000_to_$44,999',
'B19001_010E_Total_$45,000_to_$49,999',
'B19001_011E_Total_$50,000_to_$59,999',
'B19001_012E_Total_$60,000_to_$74,999',
'B19001_013E_Total_$75,000_to_$99,999',
'B19001_014E_Total_$100,000_to_$124,999',
'B19001_015E_Total_$125,000_to_$149,999',
'B19001_016E_Total_$150,000_to_$199,999',
'B19001_017E_Total_$200,000_or_more', 'state', 'county', 'tract'],
dtype='object')
right_df Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Left on:
Right on:
How: (‘left’, ‘right’, ‘outer’, ‘inner’)
Save File ('Y' or 'N'):
merged_df.head(1)
|
B19001_001E_Total |
B19001_002E_Total_Less_than_$10,000 |
B19001_003E_Total_$10,000_to_$14,999 |
B19001_004E_Total_$15,000_to_$19,999 |
B19001_005E_Total_$20,000_to_$24,999 |
B19001_006E_Total_$25,000_to_$29,999 |
B19001_007E_Total_$30,000_to_$34,999 |
B19001_008E_Total_$35,000_to_$39,999 |
B19001_009E_Total_$40,000_to_$44,999 |
B19001_010E_Total_$45,000_to_$49,999 |
B19001_011E_Total_$50,000_to_$59,999 |
B19001_012E_Total_$60,000_to_$74,999 |
B19001_013E_Total_$75,000_to_$99,999 |
B19001_014E_Total_$100,000_to_$124,999 |
B19001_015E_Total_$125,000_to_$149,999 |
B19001_016E_Total_$150,000_to_$199,999 |
B19001_017E_Total_$200,000_or_more |
state |
county |
TRACTCE10 |
GEOID10 |
CSA2010 |
0 |
568 |
128 |
24 |
44 |
41 |
4 |
25 |
20 |
0 |
21 |
18 |
31 |
129 |
32 |
37 |
9 |
5 |
24 |
510 |
130805.0 |
2.45e+10 |
Mount Washington... |
Advanced
For this next example to work, we will need to import hypothetical csv files
Intro
The following Python function is a bulked out version of the previous notes.
- It contains everything from the tutorial plus more.
- It can be imported and used in future projects or stand alone.
Description: add columns of data from a foreign dataset into a primary dataset along set parameters.
Purpose: Makes Merging datasets simple
Services
- Merge two datasets without a crosswalk
- Merge two datasets with a crosswalk
#export
#@ title Run: Create mergeDatasets()
# Worried about infinit interactive-loops. not an issue atm.
# Crosswalk needs to have exact same column names as left/right datasets
def mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False,
left_col=False, right_col=False,
crosswalk_left_col = False, crosswalk_right_col = False,
merge_how=False, # left right or columnname to retrieve
interactive=True):
# Interactive will ask if use_crosswalk unless crosswalk_ds == 'no'
# 1. Used on Right Dataset in case merge_how is a column to pull. Returns False or Col
def checkMergeHow(ds, how, interactive):
inList = how in ['left', 'right', 'outer', 'inner']
inDf = Intake.checkColumn(ds, how)
if ( inList or inDf ): return how
elif ( not interactive ): return False
else:
try:
print('\n Invalid merge column given. \n Please select a value from either list');
print("\n 1) Pull A single column from the right dataset: ", ds.columns)
print("OR \n 2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName) " )
return checkMergeHow(ds, input("Column Name: " ), interactive);
except: return False # User probably trying to escape interactivity
# 2i. Load data via url. Coerce Dtype needed for merge.
def coerceForMerge( msg, first_ds, second_ds, first_col, second_col, interactive ):
if (interactive):
print(f'\n---Casting Datatypes from-to: {msg} Datasets---');
print('Before Casting: ');
print('-> Column One: ', first_col, first_ds[first_col].dtype)
print('-> Column Two: ', second_col, second_ds[second_col].dtype)
second_ds, second_col = Intake.getAndCheck(second_ds, second_col, interactive)
first_ds, second_ds, status = Intake.coerce(first_ds, second_ds, first_col, second_col, interactive);
if (not status and interactive): print('\n There was a problem!');
if (interactive):
print('\n After Casting: ');
print('-> Column One: ', first_col, first_ds[first_col].dtype)
print('-> Column Two: ', second_col, second_ds[second_col].dtypes)
return first_ds, second_ds, second_col, status
# 2ii.
def mergeAndFilter(msg, first_ds, second_ds, first_col, second_col, how, interactive):
if interactive:
print(f'---PERFORMING MERGE : {msg}---');
print('Column One : ', first_col, first_ds[first_col].dtype)
print('How: ', how)
print('Column Two : ', second_col, second_ds[second_col].dtype)
first_ds = mergeOrPull(first_ds, second_ds, first_col, second_col, how)
return filterEmpties(first_ds, second_ds, first_col, second_col, how, interactive)
# Decide to perform a merge or commit a pull
def mergeOrPull(df, cw, left_on, right_on, how):
def merge(df, cw, left_on, right_on, how):
df = pd.merge(df, cw, left_on=left_on, right_on=right_on, how=how)
# df.drop(left_on, axis=1)
df[right_on] = df[right_on].fillna(value='empty')
return df
def pull(df, cw, left_on, right_on, how):
crswlk = dict(zip(cw[right_on], cw[how] ) )
dtype = df[left_on].dtype
if dtype =='object': df[how] = df.apply(lambda row: crswlk.get(str(row[left_on]), "empty"), axis=1)
elif dtype == 'int64':
df[how] = df.apply(lambda row: crswlk.get(int(row[left_on]), "empty"), axis=1)
return df
mergeType = how in ['left', 'right', 'outer', 'inner']
if mergeType: return merge(df, cw, left_on, right_on, how)
else: return pull(df, cw, left_on, right_on, how)
# 2iiii. Filter between matched records and not.
def filterEmpties(df, cw, left_on, right_on, how, interactive):
if how in ['left', 'right', 'outer', 'inner']: how = right_on
nomatch = df.loc[df[how] == 'empty']
nomatch = nomatch.sort_values(by=left_on, ascending=True)
if nomatch.shape[0] > 0:
# Do the same thing with our foreign tracts
if(interactive):
print('\n Local Column Values Not Matched ')
print(nomatch[left_on].unique() )
print(len(nomatch[left_on]))
print('\n Crosswalk Unique Column Values')
print(cw[right_on].unique() )
# Create a new column with the tracts value mapped to its corresponding value from the crossswalk
df[how].replace('empty', np.nan, inplace=True)
df.dropna(subset=[how], inplace=True)
# cw = cw.sort_values(by=how, ascending=True)
return df
# 0. Retrieve the left and right dataset.
if (interactive): print('---Handling Left Dataset Options---');
left_ds, left_col = Intake.getAndCheck(left_ds, left_col, interactive)
if (interactive): print('Left column:', left_col)
if (interactive): print('\n---Handling Right Dataset Options---');
right_ds, right_col = Intake.getAndCheck(right_ds, right_col, interactive)
if (interactive): print('Right column:', left_col)
if (interactive): print(f"\n---Ensuring Compatability Between merge_how (val: '{merge_how}') and the Right Dataset---");
merge_how = checkMergeHow(right_ds, merge_how, interactive)
if (interactive): print("Column or ['inner','left','right','outer'] value: ", merge_how)
# 1. Retrieve the crosswalk dataset: check left-cw, right-cw. try coercing.
if (interactive): print(f'\n---Checking Crosswalk Dataset Options---')
# if its a df
if (not Intake.isPandas(crosswalk_ds)):
default = str(crosswalk_ds).lower() == 'false'
# If the user used the the default crosswalk value 'False' as them if they want to use one.
if (default and interactive ): crosswalk_ds = input("\nProvide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' ) ") or False
# Check if user opted to not use a crosswalk
use_crosswalk = not ((str(crosswalk_ds).lower() in ["no", '', 'none', 'false']))
if (use_crosswalk):
crosswalk_ds, crosswalk_left_col = Intake.getAndCheck(crosswalk_ds, crosswalk_left_col, interactive)
crosswalk_ds, crosswalk_right_col = Intake.getAndCheck(crosswalk_ds, crosswalk_right_col, interactive)
# 3. Coerce all datasets for Merge.
if ( Intake.isPandas(crosswalk_ds) ):
print('crosswalk_left_col',crosswalk_left_col)
left_ds, crosswalk_ds, crosswalk_left_col, status = coerceForMerge( 'Left->Crosswalk', left_ds, crosswalk_ds, left_col, crosswalk_left_col, interactive )
right_ds, crosswalk_ds, crosswalk_right_col, status = coerceForMerge( 'Right->Crosswalk',right_ds, crosswalk_ds, right_col, crosswalk_right_col, interactive )
else:
left_ds, right_ds, right_col, status = coerceForMerge('Left->Right', left_ds, right_ds, left_col, right_col, interactive )
if (interactive): print('\n---All checks complete. Status: ', status, '---\n');
if ( not status ):
if (interactive):print('Merge Incomplete. Thank you!');
return False;
else:
if (Intake.isPandas(crosswalk_ds)):
left_ds = mergeAndFilter('LEFT->CROSSWALK', left_ds, crosswalk_ds, left_col, crosswalk_left_col, crosswalk_right_col, interactive)
left_col = crosswalk_right_col
left_ds = mergeAndFilter('LEFT->RIGHT', left_ds, right_ds, left_col, right_col, merge_how, interactive)
return left_dsFunction Explanation
Input(s):
- Dataset url
- Crosswalk Url
- Right On
- Left On
- How
- New Filename
Output: File
How it works:
-
Read in datasets
-
Perform Merge
-
If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']
-
- then a merge will be performed.
-
If a column name is provided in the 'how' parameter
-
- then that single column will be pulled from the right dataset as a new column in the left_ds.
Function Diagrams
Diagram the mergeDatasets()

mergeDatasets Flow Chart

Gannt Chart mergeDatasets()

Sequence Diagram mergeDatasets()

Function Examples
# from VitalSigns.acsDownload import retrieve_acs_data
# from dataplay.geoms import readInGeometryData Interactive Example 1. Merge Esri Data
# Table: Household Childhood Poverty
# Hhchpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
# Hhchpov = Hhchpov[['CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17', 'hhchpov18']]
left_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
left_col = 'CSA2010'
# Table: Household Poverty
# Hhpov = Intake.getData("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson", interactive=True)
# Hhpov = Hhpov[['CSA2010', 'hhpov15', 'hhpov16', 'hhpov17', 'hhpov18']]
right_ds = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
right_col='CSA2010'
merge_how = 'outer'
interactive = False
merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds='no',
left_col=left_col, right_col=right_col,
crosswalk_left_col = False, crosswalk_right_col = False,
merge_how=merge_how, # left right or columnname to retrieve
interactive=interactive)
merged_df.head()
|
OBJECTID_x |
CSA2010 |
hhchpov15 |
hhchpov16 |
hhchpov17 |
hhchpov18 |
hhchpov19 |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.60 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
24.15 |
21.28 |
20.70 |
23.00 |
19.18 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
2 |
Beechfield/Ten H... |
19.42 |
21.22 |
23.92 |
21.90 |
15.38 |
4.79e+07 |
37524.95 |
POLYGON ((-76.69... |
2 |
11.17 |
11.59 |
10.47 |
10.90 |
8.82 |
4.79e+07 |
37524.95 |
POLYGON ((-76.69... |
2 |
3 |
Belair-Edison |
36.88 |
36.13 |
34.56 |
39.74 |
41.04 |
4.50e+07 |
31307.31 |
POLYGON ((-76.56... |
3 |
18.61 |
19.59 |
20.27 |
22.83 |
22.53 |
4.50e+07 |
31307.31 |
POLYGON ((-76.56... |
3 |
4 |
Brooklyn/Curtis ... |
45.01 |
46.45 |
46.41 |
39.89 |
41.39 |
1.76e+08 |
150987.70 |
MULTIPOLYGON (((... |
4 |
28.36 |
26.33 |
24.21 |
21.54 |
24.60 |
1.76e+08 |
150987.70 |
MULTIPOLYGON (((... |
4 |
5 |
Canton |
5.49 |
2.99 |
4.02 |
4.61 |
4.83 |
1.54e+07 |
23338.61 |
POLYGON ((-76.57... |
5 |
3.00 |
2.26 |
3.66 |
2.05 |
2.22 |
1.54e+07 |
23338.61 |
POLYGON ((-76.57... |
Example 2 ) Get CSA and Geometry with a Crosswalk using 3 links
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510' # '059' # 153 '510'
state = '24' #51
# Specify the download parameters the function will receieve here
tableId = 'B19049' # 'B19001'
year = '17'
saveAcs = False import pandas as pd
import IPython
# from IPython.core.display import HTML
IPython.core.display.HTML("")
# state, county, tract, tableId, year, saveOriginal, save
left_df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
left_df.head(1) Number of Columns 5
|
B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total |
B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years |
B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years |
B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years |
B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over |
state |
county |
tract |
NAME |
|
|
|
|
|
|
|
|
Census Tract 2710.02 |
38358 |
-666666666 |
34219 |
40972 |
37143 |
24 |
510 |
271002 |
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = left_df
left_col = 'tract'
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://raw.githubusercontent.com/bniajfi/bniajfi/main/CSA-to-Tract-2010.csv'
crosswalk_left_col = 'TRACTCE10'
crosswalk_right_col = 'CSA2010'
# Table:
right_ds = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
right_col = 'CSA2010'
interactive = False
merge_how = 'outer'
merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=crosswalk_ds,
left_col=left_col, right_col=right_col,
crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
merge_how=merge_how, # left right or columnname to retrieve
interactive=interactive)
merged_df_geom.head()crosswalk_left_col TRACTCE10
|
B19049_001E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Total |
B19049_002E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_under_25_years |
B19049_003E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_25_to_44_years |
B19049_004E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_45_to_64_years |
B19049_005E_Median_household_income_in_the_past_12_months_(in_2017_inflation-adjusted_dollars)_--_Householder_65_years_and_over |
state |
county |
tract |
CSA2010 |
OBJECTID |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
Shape__Area |
Shape__Length |
geometry |
0 |
38358 |
-666666666 |
34219 |
40972 |
37143 |
24 |
510 |
271002 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
1 |
44904 |
-666666666 |
51324 |
42083 |
37269 |
24 |
510 |
90100 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
2 |
35707 |
2499 |
42292 |
37361 |
29191 |
24 |
510 |
271001 |
Greater Govans |
20.0 |
21.32 |
19.27 |
19.53 |
17.99 |
20.50 |
2.27e+07 |
22982.13 |
POLYGON ((-76.59... |
3 |
42231 |
-666666666 |
46467 |
45484 |
18750 |
24 |
510 |
260402 |
Claremont/Armistead |
9.0 |
21.27 |
23.59 |
24.00 |
24.64 |
23.88 |
6.12e+07 |
40104.42 |
POLYGON ((-76.52... |
4 |
31657 |
20800 |
26074 |
60959 |
37396 |
24 |
510 |
260403 |
Claremont/Armistead |
9.0 |
21.27 |
23.59 |
24.00 |
24.64 |
23.88 |
6.12e+07 |
40104.42 |
POLYGON ((-76.52... |
Here we can save the data so that it may be used in later tutorials.
# string = 'test_save_data_with_geom_and_csa'
# merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)Example 3: Ran Alone
mergeDatasets( ).head(1)---Handling Left Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Left column: CSA2010
---Handling Right Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Right column: CSA2010
---Ensuring Compatability Between merge_how (val: 'False') and the Right Dataset---
Invalid merge column given.
Please select a value from either list
1) Pull A single column from the right dataset: Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
OR
2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName)
Column Name: inner
Column or ['inner','left','right','outer'] value: inner
---Checking Crosswalk Dataset Options---
Provide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' )
---Casting Datatypes from-to: Left->Right Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---All checks complete. Status: True ---
---PERFORMING MERGE : LEFT->RIGHT---
Column One : CSA2010 object
How: inner
Column Two : CSA2010 object
|
OBJECTID_x |
CSA2010 |
hhchpov15_x |
hhchpov16_x |
hhchpov17_x |
hhchpov18_x |
hhchpov19_x |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhchpov15_y |
hhchpov16_y |
hhchpov17_y |
hhchpov18_y |
hhchpov19_y |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
mergeDatasets().head(1)---Handling Left Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Left column: CSA2010
---Handling Right Dataset Options---
Getting Data From: False
Error: Try Again? ( URL/ PATH or 'NO'/ ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Right column: CSA2010
---Ensuring Compatability Between merge_how (val: 'False') and the Right Dataset---
Invalid merge column given.
Please select a value from either list
1) Pull A single column from the right dataset: Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
OR
2) Specify a type of join operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName)
Column Name: inner
Column or ['inner','left','right','outer'] value: inner
---Checking Crosswalk Dataset Options---
Provide a Crosswalk? ( URL/ PATH or 'NO'/ / 'FALSE' ) https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
Invalid column given: False
Index(['OBJECTID', 'CSA2010', 'hhchpov15', 'hhchpov16', 'hhchpov17',
'hhchpov18', 'hhchpov19', 'Shape__Area', 'Shape__Length', 'geometry'],
dtype='object')
Please enter a new column fom the list above.
Column Name: CSA2010
crosswalk_left_col CSA2010
---Casting Datatypes from-to: Left->Crosswalk Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---Casting Datatypes from-to: Right->Crosswalk Datasets---
Before Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
After Casting:
-> Column One: CSA2010 object
-> Column Two: CSA2010 object
---All checks complete. Status: True ---
---PERFORMING MERGE : LEFT->CROSSWALK---
Column One : CSA2010 object
How: CSA2010
Column Two : CSA2010 object
---PERFORMING MERGE : LEFT->RIGHT---
Column One : CSA2010 object
How: inner
Column Two : CSA2010 object
|
OBJECTID_x |
CSA2010 |
hhchpov15_x |
hhchpov16_x |
hhchpov17_x |
hhchpov18_x |
hhchpov19_x |
Shape__Area_x |
Shape__Length_x |
geometry_x |
OBJECTID_y |
hhchpov15_y |
hhchpov16_y |
hhchpov17_y |
hhchpov18_y |
hhchpov19_y |
Shape__Area_y |
Shape__Length_y |
geometry_y |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
1 |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |