# default_exp tidyaddr
Notes on Security:
- Colabs runs in a free, secure and private vm hosted by google (a tiered-service)
- All data in the VM is lost when the VM Shuts down which happens a minute or so after the user leaves the colab
1. Upload a csv file
- Execute next cell-block in order to run tidyAddr.
- You will be prompted to upload a CSV file. It must contain only one column labeled 'address' exactly.
print(' ~~~~~~~~~~~ UPLOAD A FILE PLEASE ~~~~~~~~~~~')
import os
from google.colab import files
uploaded = files.upload()fileuploadedFilename = "AFile_2020_addronly.csv"
# d = list( uploaded.keys() )[0]#export
# Input: A CSV with a single column titled 'address' (may require pre-processing to get it like this)
# Output: A CSV with a single column containing text that can be split into 5 columns in xl.
# - Needs to be re-merged with any other columns removed before tidy addring.
import os
def runTidyAddr( filename, newfilename ):
print( filename, newfilename )
print(' ~~~~~~~~~~~ Getting TIDYADDR ~~~~~~~~~~~ ')
! rm tidyaddr-js -r
! git clone https://github.com/BNIA/tidyaddr-js.git
print(' ~~~~~~~~~~~ Installing TIDYADDR ~~~~~~~~~~~')
! echo ORIGINAL_DIRECTORY && echo $(ls)
! cp *.csv tidyaddr-js/ && cd tidyaddr-js && npm install
print(' ~~~~~~~~~~~ Running TIDYADDR ~~~~~~~~~~~')
! echo TIDYADDR_DIRECTORY && cd tidyaddr-js && echo $(ls)
txt = "cd tidyaddr-js && node tidyaddr.js clean-csv " + filename + " " + newfilename
os.system(txt)
runTidyAddr( uploadedFilename, './'+uploadedFilename.replace(".", "_tidyaddred.") )
- The output will be a single text column. It can be split by ";" in excel to break it out into 5 or so different columns of data.
- This output (once split into many columns) will need to be re-appended to the original dataset to have a final complete version.
ls
Debugging
This following section is python code. Its not part of tidy addr but I use it to quickly inspect the dataframe in case there is any problem.
import pandas as pdpd.read_csv('bnia_snap_for_2019.csv')df = pd.read_csv('bnia_tanf_2019_20208_by_zipcodes.csv')
df.head(1)ls