--- title: Merge Data keywords: fastai sidebar: home_sidebar summary: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." description: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." nb_path: "notebooks/02_Merge_Data.ipynb" ---
{% raw %}
/content/drive/My Drive/Sites/dataplay/dataplay/acsDownload.py:27: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', -1)
/usr/local/lib/python3.7/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)
{% endraw %}

This Coding Notebook is the second in a series.

An Interactive version can be found here Open In Colab.

This colab and more can be found on our webpage.

  • Content covered in previous tutorials will be used in later tutorials.

  • New code and or information should have explanations and or descriptions attached.

  • Concepts or code covered in previous tutorials will be used without being explaining in entirety.

  • The Dataplay Handbook development techniques covered in the Datalabs Guidebook

  • If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.

  • This notebook has been optimized for Google Colabs ran on a Chrome Browser.

  • Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.

Binder Binder Binder Open Source Love svg3

NPM License Active Python Versions GitHub last commit No Maintenance Intended

GitHub stars GitHub watchers GitHub forks GitHub followers

Tweet Twitter Follow

About this Tutorial:

Whats Inside?

The Tutorial

In this notebook, the basics of how to perform a merge are introduced.

  • We will merge two datasets
  • We will merge two datasets using a crosswalk

Objectives

By the end of this tutorial users should have an understanding of:

  • How dataset merges are performed
  • The types different union approaches a merge can take
  • The 'mergeData' function, and how to use it in the future

Guided Walkthrough

SETUP

Install these libraries onto the virtual environment.

{% raw %}
%%capture
!pip install geopandas
!pip install dataplay
{% endraw %} {% raw %}
 
{% endraw %} {% raw %}
{% endraw %}

Retrieve Datasets

Our example will merge two simple datasets; pulling CSA names using tract ID's.

The First dataset will be obtained from the Census' ACS 5-year serveys.

Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.

The Second dataset is from a publicly accessible link

Get the Principal dataset.

We will use the function we created in our last tutorial to download the data!

{% raw %}
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'

# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = False
{% endraw %} {% raw %}
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()
Number of Columns 17
B19001_001E_Total B19001_002E_Total_Less_than_$10,000 B19001_003E_Total_$10,000_to_$14,999 B19001_004E_Total_$15,000_to_$19,999 B19001_005E_Total_$20,000_to_$24,999 B19001_006E_Total_$25,000_to_$29,999 B19001_007E_Total_$30,000_to_$34,999 B19001_008E_Total_$35,000_to_$39,999 B19001_009E_Total_$40,000_to_$44,999 B19001_010E_Total_$45,000_to_$49,999 B19001_011E_Total_$50,000_to_$59,999 B19001_012E_Total_$60,000_to_$74,999 B19001_013E_Total_$75,000_to_$99,999 B19001_014E_Total_$100,000_to_$124,999 B19001_015E_Total_$125,000_to_$149,999 B19001_016E_Total_$150,000_to_$199,999 B19001_017E_Total_$200,000_or_more state county tract
NAME
Census Tract 2710.02 1510 209 73 94 97 110 119 97 65 36 149 168 106 66 44 50 27 24 510 271002
Census Tract 2604.02 1134 146 29 73 80 41 91 49 75 81 170 57 162 63 11 6 0 24 510 260402
Census Tract 2712 2276 69 43 41 22 46 67 0 30 30 80 146 321 216 139 261 765 24 510 271200
Census Tract 2804.04 961 111 108 61 42 56 37 73 30 31 106 119 74 23 27 24 39 24 510 280404
Census Tract 901 1669 158 124 72 48 108 68 121 137 99 109 191 160 141 28 88 17 24 510 90100
{% endraw %}

Get the Secondary Dataset

Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for Tiger/Line Shapefiles

The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov

For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.

{% raw %}
!curl https://bniajfi.org/vs_resources/CSA-to-Tract-2010.csv	> CSA-to-Tract-2010.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8101  100  8101    0     0  23896      0 --:--:-- --:--:-- --:--:-- 23896
{% endraw %} {% raw %}
print('Boundaries Example:CSA-to-Tract-2010.csv')
Boundaries Example:CSA-to-Tract-2010.csv
{% endraw %} {% raw %}
# Our Example dataset contains Polygon Geometry information. 
# We want to merge this over to our principle dataset. 
# we will grab it by matching on either CSA or Tract

# The url listed below is public.

print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv')

inFile = input("\n Please enter the location of your file : \n" )

crosswalk = pd.read_csv( inFile ) 
crosswalk.head()
Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv

 Please enter the location of your file : 
CSA-to-Tract-2010.csv
TRACTCE10 GEOID10 CSA2010
0 10100 24510010100 Canton
1 10200 24510010200 Patterson Park N...
2 10300 24510010300 Canton
3 10400 24510010400 Canton
4 10500 24510010500 Fells Point
{% endraw %} {% raw %}
crosswalk.columns
Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
{% endraw %}

Perform Merge & Save

The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.

  • Left - returns all left records, only includes the right record if it has a match
  • Right - Returns all right records, only includes the left record if it has a match
  • Full - Returns all records regardless of keys matching
  • Inner - Returns only records where a key match

Get Columns from both datasets to match on

You can get these values from the column values above.

Our Examples will work with the prompted values

{% raw %}
print( 'Princpal Columns ' + str(crosswalk.columns) + '')
left_on = input("Left on principal column: ('tract') \n" )
print(' \n ');
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) 
Princpal Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Left on principal column: ('tract') 
tract
 
 
Crosswalk Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Right on crosswalk column: ('TRACTCE10') 
TRACTCE10
{% endraw %}

Specify how the merge will be performed

We will perform a left merge in this example.

It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.

{% raw %}
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )
How: (‘left’, ‘right’, ‘outer’, ‘inner’) outer
{% endraw %}

Actually perfrom the merge

{% raw %}
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()
B19001_001E_Total B19001_002E_Total_Less_than_$10,000 B19001_003E_Total_$10,000_to_$14,999 B19001_004E_Total_$15,000_to_$19,999 B19001_005E_Total_$20,000_to_$24,999 B19001_006E_Total_$25,000_to_$29,999 B19001_007E_Total_$30,000_to_$34,999 B19001_008E_Total_$35,000_to_$39,999 B19001_009E_Total_$40,000_to_$44,999 B19001_010E_Total_$45,000_to_$49,999 B19001_011E_Total_$50,000_to_$59,999 B19001_012E_Total_$60,000_to_$74,999 B19001_013E_Total_$75,000_to_$99,999 B19001_014E_Total_$100,000_to_$124,999 B19001_015E_Total_$125,000_to_$149,999 B19001_016E_Total_$150,000_to_$199,999 B19001_017E_Total_$200,000_or_more state county TRACTCE10 GEOID10 CSA2010
0 1510 209 73 94 97 110 119 97 65 36 149 168 106 66 44 50 27 24 510 271002.0 2.45e+10 Greater Govans
1 1134 146 29 73 80 41 91 49 75 81 170 57 162 63 11 6 0 24 510 260402.0 2.45e+10 Claremont/Armistead
2 2276 69 43 41 22 46 67 0 30 30 80 146 321 216 139 261 765 24 510 271200.0 2.45e+10 North Baltimore/...
3 961 111 108 61 42 56 37 73 30 31 106 119 74 23 27 24 39 24 510 280404.0 2.45e+10 Allendale/Irving...
4 1669 158 124 72 48 108 68 121 137 99 109 191 160 141 28 88 17 24 510 90100.0 2.45e+10 Greater Govans
{% endraw %}

As you can see, our Census data will now have a CSA appended to it.

{% raw %}
outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) 
Please enter the new Filename to save the data to ('acs_csa_merge_test': acs_csa_merge_test
{% endraw %}

Final Result

{% raw %}
flag = input("Enter a URL? If not ACS data will be used. (Y/N):  " )
if (flag == 'y' or flag == 'Y'):
  df = pd.read_csv( input("Please enter the location of your Principal file: " ) )
else:
  tract = input("Please enter tract id (*): " )
  county = input("Please enter county id (510): " )
  state = input("Please enter state id (24): " )
  tableId = input("Please enter acs table id (B19001): " ) 
  year = input("Please enter acs year (18): " )
  saveAcs = input("Save ACS? (Y/N): " )
  df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)

print( 'Principal Columns ' + str(df.columns))

print('Crosswalk Example: CSA-to-Tract-2010.csv')

crosswalk = pd.read_csv( input("Please enter the location of your crosswalk file: " ) )
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '\n')

left_on = input("Left on: " )
right_on = input("Right on: " )
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )

merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)

# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ")
if saveFile == 'Y' or saveFile == 'y':
  outFile = input("Saved Filename (Do not include the file extension ): ")
  merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);
Enter a URL? If not ACS data will be used. (Y/N):  n
Please enter tract id (*): *
Please enter county id (510): 510
Please enter state id (24): 24
Please enter acs table id (B19001): B19001
Please enter acs year (18): 18
Save ACS? (Y/N): N
Number of Columns 17
Principal Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
       'B19001_003E_Total_$10,000_to_$14,999',
       'B19001_004E_Total_$15,000_to_$19,999',
       'B19001_005E_Total_$20,000_to_$24,999',
       'B19001_006E_Total_$25,000_to_$29,999',
       'B19001_007E_Total_$30,000_to_$34,999',
       'B19001_008E_Total_$35,000_to_$39,999',
       'B19001_009E_Total_$40,000_to_$44,999',
       'B19001_010E_Total_$45,000_to_$49,999',
       'B19001_011E_Total_$50,000_to_$59,999',
       'B19001_012E_Total_$60,000_to_$74,999',
       'B19001_013E_Total_$75,000_to_$99,999',
       'B19001_014E_Total_$100,000_to_$124,999',
       'B19001_015E_Total_$125,000_to_$149,999',
       'B19001_016E_Total_$150,000_to_$199,999',
       'B19001_017E_Total_$200,000_or_more', 'state', 'county', 'tract'],
      dtype='object')
Crosswalk Example: CSA-to-Tract-2010.csv
Please enter the location of your crosswalk file: CSA-to-Tract-2010.csv
Crosswalk Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')

Left on: tract
Right on: TRACTCE10
How: (‘left’, ‘right’, ‘outer’, ‘inner’) outer
Save File ('Y' or 'N'): n
{% endraw %} {% raw %}
merged_df
B19001_001E_Total B19001_002E_Total_Less_than_$10,000 B19001_003E_Total_$10,000_to_$14,999 B19001_004E_Total_$15,000_to_$19,999 B19001_005E_Total_$20,000_to_$24,999 B19001_006E_Total_$25,000_to_$29,999 B19001_007E_Total_$30,000_to_$34,999 B19001_008E_Total_$35,000_to_$39,999 B19001_009E_Total_$40,000_to_$44,999 B19001_010E_Total_$45,000_to_$49,999 B19001_011E_Total_$50,000_to_$59,999 B19001_012E_Total_$60,000_to_$74,999 B19001_013E_Total_$75,000_to_$99,999 B19001_014E_Total_$100,000_to_$124,999 B19001_015E_Total_$125,000_to_$149,999 B19001_016E_Total_$150,000_to_$199,999 B19001_017E_Total_$200,000_or_more state county TRACTCE10 GEOID10 CSA2010
0 1842 140 71 143 112 169 101 113 116 56 147 158 151 200 8 89 68 24 510 272004.0 2.45e+10 Cross-Country/Ch...
1 1638 246 71 90 104 74 168 115 97 173 115 16 117 64 0 91 97 24 510 120202.0 2.45e+10 Greater Charles ...
2 1252 91 27 55 75 71 59 85 44 47 80 131 108 78 43 104 154 24 510 272005.0 2.45e+10 Cross-Country/Ch...
3 1622 235 177 184 79 67 88 71 64 60 72 155 140 100 77 20 33 24 510 272006.0 2.45e+10 Glen-Fallstaff
4 1775 127 70 190 136 111 114 152 106 11 172 143 193 42 86 53 69 24 510 272007.0 2.45e+10 Glen-Fallstaff
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
196 2143 138 37 76 33 16 30 28 17 64 111 286 181 206 100 234 586 24 510 20300.0 2.45e+10 Fells Point
197 1220 159 93 82 137 51 79 102 79 98 55 140 85 38 22 0 0 24 510 150400.0 2.45e+10 Greater Mondawmin
198 1393 41 12 22 0 10 11 19 85 30 95 128 200 207 240 180 113 24 510 10200.0 2.45e+10 Patterson Park N...
199 761 94 39 39 53 31 20 16 33 27 49 127 50 64 22 49 48 24 510 60400.0 2.45e+10 Oldtown/Middle East
200 238436 27518 15246 13682 11154 11984 12142 10574 10545 8506 19226 20731 25536 17068 10693 11228 12603 24 510 NaN NaN NaN

201 rows × 22 columns

{% endraw %}

Advanced

For this next example to work, we will need to import hypothetical csv files

Intro

The following Python function is a bulked out version of the previous notes.

  • It contains everything from the tutorial plus more.
  • It can be imported and used in future projects or stand alone.

Description: add columns of data from a foreign dataset into a primary dataset along set parameters.

Purpose: Makes Merging datasets simple

Services

  • Merge two datasets without a crosswalk
  • Merge two datasets with a crosswalk
{% raw %}

mergeDatasets[source]

mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False, left_col=False, right_col=False, crosswalk_left_col=False, crosswalk_right_col=False, merge_how=False, interactive=True)

{% endraw %} {% raw %}
{% endraw %}

Function Explanation

Input(s):

  • Dataset url
  • Crosswalk Url
  • Right On
  • Left On
  • How
  • New Filename

Output: File

How it works:

  • Read in datasets
  • Perform Merge

  • If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']

    • then a merge will be performed.
  • If a column name is provided in the 'how' parameter
    • then that single column will be pulled from the right dataset as a new column in the left_ds.

Function Diagrams

Diagram the mergeDatasets()

{% raw %}
%%html
<img src="https://bniajfi.org/images/mermaid/class_diagram_merge_datasets.PNG">
{% endraw %}

mergeDatasets Flow Chart

{% raw %}
%%html
<img src="https://bniajfi.org/images/mermaid/flow_chart_merge_datasets.PNG">
{% endraw %}

Gannt Chart mergeDatasets()

{% raw %}
%%html
<img src="https://bniajfi.org/images/mermaid/gannt_chart_merge_datasets.PNG">
{% endraw %}

Sequence Diagram mergeDatasets()

{% raw %}
%%html
<img src="https://bniajfi.org/images/mermaid/sequence_diagram_merge_datasets.PNG">
{% endraw %}

Function Examples

Interactive Example 1

{% raw %}
import geopandas as gpd
import numpy as np
import pandas as pd
from dataplay.geoms import readInGeometryData 
Hhchpov = gpd.read_file("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/1/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson")
Hhchpov = Hhchpov[['CSA2010', 'hhchpov15',	'hhchpov16',	'hhchpov17',	'hhchpov18']]
Hhchpov.to_csv('Hhchpov.csv')

Hhpov = gpd.read_file("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/1/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson")
Hhpov = Hhpov[['CSA2010', 'hhpov15',	'hhpov16',	'hhpov17',	'hhpov18']]
Hhpov.to_csv('Hhpov.csv')
{% endraw %} {% raw %}
Hhchpov.head(1)
CSA2010 hhchpov15 hhchpov16 hhchpov17 hhchpov18
0 Allendale/Irving... 38.93 34.73 32.77 35.27
{% endraw %} {% raw %}
Hhpov.head(1)
CSA2010 hhpov15 hhpov16 hhpov17 hhpov18
0 Allendale/Irving... 24.15 21.28 20.7 23.0
{% endraw %} {% raw %}
ls
24510_B19001_5y18_est.csv           CSA-to-Tract-2010.csv  sample_data/
24510_B19001_5y18_est_Original.csv  Hhchpov.csv
acs_csa_merge_test.csv              Hhpov.csv
{% endraw %} {% raw %}
import pandas as pd
import geopandas as gpd
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'Hhpov.csv'
left_col = 'CSA2010'

# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
right_ds = 'Hhchpov.csv'
right_col='CSA2010'

merge_how = 'outer'
interactive = True

merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=False,
                  left_col=left_col, right_col=right_col,
                  crosswalk_left_col = False, crosswalk_right_col = False,
                  merge_how=merge_how, # left right or columnname to retrieve
                  interactive=True)
merged_df.head()
 Loading Left Dataset
getData :  Hhpov.csv

 Loading Right Dataset
getData :  Hhchpov.csv

 Validating the merge_how Parameter

 Import a crosswalk? ('True'/'False') False

 coerceForMerge: Left-Right
cols :  CSA2010 CSA2010
BEFORE COERCE :  object object
AFTER COERCE object object CSA2010
PERFORMING MERGE : LEFT->RIGHT
first_col :  CSA2010 object
how:  outer
second_col :  CSA2010 object
Unnamed: 0_x CSA2010 hhpov15 hhpov16 hhpov17 hhpov18 Unnamed: 0_y hhchpov15 hhchpov16 hhchpov17 hhchpov18
0 0 Allendale/Irving... 24.15 21.28 20.70 23.00 0 38.93 34.73 32.77 35.27
1 1 Beechfield/Ten H... 11.17 11.59 10.47 10.90 1 19.42 21.22 23.92 21.90
2 2 Belair-Edison 18.61 19.59 20.27 22.83 2 36.88 36.13 34.56 39.74
3 3 Brooklyn/Curtis ... 28.36 26.33 24.21 21.54 3 45.01 46.45 46.41 39.89
4 4 Canton 3.00 2.26 3.66 2.05 4 5.49 2.99 4.02 4.61
{% endraw %} {% raw %}
ls
24510_B19001_5y18_est.csv           CSA-to-Tract-2010.csv  sample_data/
24510_B19001_5y18_est_Original.csv  Hhchpov.csv
acs_csa_merge_test.csv              Hhpov.csv
{% endraw %}
{% raw %}
# Description: I created a public dataset from a google xlsx sheet 'Bank Addresses and Census Tract' from a workbook of the same name.
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https28768&single=true&output=csv'
left_col = 'Census Tract'

# Alternate Primary Table
# Description: Same workbook, different Sheet: 'Branches per tract' 
# Columns: Census Tract, Number branches per tract
# left_ds = 'https://docssingle=true&output=csv'
# lef_col = 'Number branches per tract'

# Crosswalk Table
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://docs.goot=csv'
use_crosswalk = True
crosswalk_left_col = 'TRACT2010'
crosswalk_right_col = 'GEOID2010'

# Secondary Table
# Table: Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'httpse=true&output=csv'
right_col ='GEOID10'

merge_how = 'geometry'
interactive = True
merge_how = 'outer'

merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=False,
                  left_col=left_col, right_col=right_col,
                  crosswalk_left_col = False, crosswalk_right_col = False,
                  merge_how=merge_how, # left right or columnname to retrieve
                  interactive=True)

merged_df_geom.head()
{% endraw %}

Here we can save the data so that it may be used in later tutorials.

{% raw %}
string = 'test_save_data_with_geom_and_csa'
merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
{% endraw %}

Example 3: Ran Alone

{% raw %}
mergeDatasets()
{% endraw %}