# default_exp intaker
⚠️ This writing is a work in progress. The functions work. ⚠️
Please read everything found on the mainpage before continuing; disclaimer and all.


About this Tutorial:
Whats inside?
The Tutorial
In this notebook, the basics of data-intake are introduced.
- Data will be imported using Colabs Terminal Commands then load this data into pythons pandas
- We will import geospatial data from Esri then load this data into geo-pandas.
- A variety of data formats will be imported.
Objectives
By the end of this tutorial users should have an understanding of:
- Importing data with pandas and geopandas
- Querying data from Esri
- Retrieveing data programmatically
- This module assumes the data needs no handling prior to intake
- Loading data in a variety of formats
Background
Importing Data with Colabs:
Instructions: Read all text and execute all code in order.
How XYZ :
If you would like to ...
For this next example to work, we will need to import hypothetical csv files
Try It! Go ahead and try running the cell below.
#hide
!pip install nbdev
from nbdev.showdoc import *
Advanced
#export
import geopandas as gpd
import numpy as np
import pandas as pd
from dataplay import geoms# hide
pd.set_option('max_colwidth', 20)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)# Can read in a CSV URL but uses dataplay.geom.readInGeometryData() for Geojson endpoints.
# Otherwise this tool assumes shp or pgeojson files have geom='geometry', in_crs=2248.
# Depending on interactivity the values should be
# coerce fillna(-1321321321321325)
# Returns # export
class Intake:
# 1. Recursively calls self/getData until something valid is given.
# Returns df or False. Calls readInGeometryData. or pulls csv directly.
# Returns df or False.
def getData(url, interactive=False):
escapeQuestionFlags = ["no", '', 'none']
if ( Intake.isPandas(url) ): return url
if (str(url).lower() in escapeQuestionFlags ): return False
if interactive: print('Getting Data From: ', url)
try:
if ([ele for ele in ['pgeojson', 'shp', 'geojson'] if(ele in url)]):
df = geoms.readInGeometryData(url=url, porg=False, geom='geometry', lat=False, lng=False, revgeocode=False, save=False, in_crs=2248, out_crs=False)
elif ('csv' in url): df = pd.read_csv( url )
return df
except:
if interactive: return Intake.getData(input("Error: Try Again? ( URL/ PATH or 'NO'/
) " ), interactive)
return False
# 1ai. A misnomer. Returns Bool.
def isPandas(df): return isinstance(df, pd.DataFrame) or isinstance(df, gpd.GeoDataFrame) or isinstance(df, tuple)
# a1. Used by Merge Lib. Returns valid (df, column) or (df, False) or (False, False).
def getAndCheck(url, col='geometry', interactive=False):
df = Intake.getData(url, interactive) # Returns False or df
if ( not Intake.isPandas(df) ):
if(interactive): print('No data was retrieved.', df)
return False, False
if (isinstance(col, list)):
for colm in col:
if not Intake.getAndCheckColumn(df, colm):
if(interactive): print('Exiting. Error on the column: ', colm)
return df, False
newcol = Intake.getAndCheckColumn(df, col, interactive) # Returns False or col
if (not newcol):
if(interactive): print('Exiting. Error on the column: ', col)
return df, col
return df, newcol
# a2. Returns Bool
def checkColumn(dataset, column): return {column}.issubset(dataset.columns)
# b1. Used by Merge Lib. Returns Both Datasets and Coerce Status
def coerce(ds1, ds2, col1, col2, interactive):
ds1, ldt, lIsNum = Intake.getdTypeAndFillNum(ds1, col1, interactive)
ds2, rdt, rIsNum = Intake.getdTypeAndFillNum(ds2, col2, interactive)
ds2 = Intake.coerceDtypes(lIsNum, rdt, ds2, col2, interactive)
ds1 = Intake.coerceDtypes(rIsNum, ldt, ds1, col1, interactive)
# Return the data and the coerce status
return ds1, ds2, (ds1[col1].dtype == ds2[col2].dtype)
# b2. Used by Merge Lib. fills na with crazy number
def getdTypeAndFillNum(ds, col, interactive):
dt = ds[col].dtype
isNum = dt == 'float64' or dt == 'int64'
if isNum: ds[col] = ds[col].fillna(-1321321321321325)
return ds, dt, isNum
# b3. Used by Merge Lib.
def coerceDtypes(isNum, dt, ds, col, interactive):
if isNum and dt == 'object':
if(interactive): print('Converting Key from Object to Int' )
ds[col] = pd.to_numeric(ds[col], errors='coerce')
if interactive: print('Converting Key from Int to Float' )
ds[col] = ds[col].astype(float)
return ds
# a3. Returns False or col. Interactive calls self
def getAndCheckColumn(df, col, interactive):
if Intake.checkColumn(df, col) : return col
if (not interactive): return False
else:
print("Invalid column given: ", col);
print(df.columns);
print("Please enter a new column fom the list above.");
col = input("Column Name: " )
return Intake.getAndCheckColumn(df, col, interactive);u = Intake
rdf = Intake.getData('https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson')
rdf.head(1)
|
OBJECTID |
CSA2010 |
hhchpov15 |
hhchpov16 |
hhchpov17 |
hhchpov18 |
hhchpov19 |
Shape__Area |
Shape__Length |
geometry |
0 |
1 |
Allendale/Irving... |
38.93 |
34.73 |
32.77 |
35.27 |
32.6 |
6.38e+07 |
38770.17 |
POLYGON ((-76.65... |
Here we can save the data so that it may be used in later tutorials.
# string = 'test_save_data_with_geom_and_csa'
# .to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)Download data by:
- Clicking the 'Files' tab in the left hand menu of this screen. Locate your file within the file explorer that appears directly under the 'Files' tab button once clicked. Right click the file in the file explorer and select the 'download' option from the dropdown.
You can upload this data into the next tutorial in one of two ways.
-
- uploading the saved file to google Drive and connecting to your drive path
OR.
-
- 'by first downloading the dataset as directed above, and then navigating to the next tutorial. Go to their page and:
- Uploading data using an file 'upload' button accessible within the 'Files' tab in the left hand menu of this screen. The next tutorial will teach you how to load this data so that it may be mapped.
Here are some examples:
Using Esri and the Geoms handler directly:
import dataplay
geoloom_gdf_url = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Geoloom_Crowd/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
geoloom_gdf = dataplay.geoms.readInGeometryData(url=geoloom_gdf_url, porg=False, geom='geometry', lat=False, lng=False, revgeocode=False, save=False, in_crs=4326, out_crs=False)
geoloom_gdf = geoloom_gdf.dropna(subset=['geometry'])
geoloom_gdf.head(1)
|
OBJECTID |
Data_type |
Attach |
ProjNm |
Descript |
Location |
URL |
Name |
PhEmail |
Comments |
POINT_X |
POINT_Y |
GlobalID |
geometry |
0 |
1 |
Artists & Resources |
None |
Joe |
Test |
123 Market Pl, B... |
|
|
|
|
-8.53e+06 |
4.76e+06 |
e59b4931-e0c8-4d... |
POINT (-76.60661... |
Again but with the Intake class:
u = Intake
Geoloom_Crowd, rcol = u.getAndCheck('https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Geoloom_Crowd/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson')
Geoloom_Crowd.head(1)
|
OBJECTID |
Data_type |
Attach |
ProjNm |
Descript |
Location |
URL |
Name |
PhEmail |
Comments |
POINT_X |
POINT_Y |
GlobalID |
geometry |
0 |
1 |
Artists & Resources |
None |
Joe |
Test |
123 Market Pl, B... |
|
|
|
|
-8.53e+06 |
4.76e+06 |
e59b4931-e0c8-4d... |
POINT (-76.60661... |
This getAndCheck function is usefull for checking for a required field.
Hhpov, rcol = u.getAndCheck('https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson', 'hhpov19', True)
Hhpov = Hhpov[['CSA2010', 'hhpov15', 'hhpov16', 'hhpov17', 'hhpov18', 'hhpov19']]
# Hhpov.to_csv('Hhpov.csv')
Hhpov.head()Getting Data From: https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson
|
CSA2010 |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
0 |
Allendale/Irving... |
24.15 |
21.28 |
20.70 |
23.00 |
19.18 |
1 |
Beechfield/Ten H... |
11.17 |
11.59 |
10.47 |
10.90 |
8.82 |
2 |
Belair-Edison |
18.61 |
19.59 |
20.27 |
22.83 |
22.53 |
3 |
Brooklyn/Curtis ... |
28.36 |
26.33 |
24.21 |
21.54 |
24.60 |
4 |
Canton |
3.00 |
2.26 |
3.66 |
2.05 |
2.22 |
We could also retrieve from a file.
u = Intake
# rdf = u.getData('Hhpov.csv')
rdf.head()
|
Unnamed: 0 |
CSA2010 |
hhpov15 |
hhpov16 |
hhpov17 |
hhpov18 |
hhpov19 |
0 |
0 |
Allendale/Irving... |
24.15 |
21.28 |
20.70 |
23.00 |
19.18 |
1 |
1 |
Beechfield/Ten H... |
11.17 |
11.59 |
10.47 |
10.90 |
8.82 |
2 |
2 |
Belair-Edison |
18.61 |
19.59 |
20.27 |
22.83 |
22.53 |
3 |
3 |
Brooklyn/Curtis ... |
28.36 |
26.33 |
24.21 |
21.54 |
24.60 |
4 |
4 |
Canton |
3.00 |
2.26 |
3.66 |
2.05 |
2.22 |