# default_exp closecrawl
Intro
It goes CLI CLIARGS >
MAIN>
SPIDER > MINER > CLEANER
To calculate foreclosure we use mortgage foreclosures(O) as apposed to tax Foreclosures(C).
We have only ever looked at tax foreclosures for specific projects like BOLD.
Instructions for obtaining foreclosure case data from MD Case Search with Web Scraper: (These instructions are for the scraper designed in 2017 forward)
Go to GitHub and download the files to have the web scraper installed on your desktop (or you can run it directly from GitHub) https://github.com/BNIA/Close-Crawl
The usage interface is described here: https://github.com/BNIA/Close-Crawl/blob/master/docs/cli/README.md
-
Enter Case type: C for tax sale foreclosures, O for mortgage foreclosures
-
Enter year: (Year must be 4 digits)
-
Enter name of output file name (must be csv file path, this will be saved onto your desktop)
-
Enter 0 for manual parameters or 1 for automatic (0 is recommended)
-
Enter the lower bound for the cases to scrape (1-4 digit integer).
A. If you are scraping cases from a year for the first time that year, you will want to start with "1".
B. In all other scenarios you need to go look at the last data scraped in the raw folder of that dataset. Look at the case number for the last case that was scraped, look at the last for digits and go up by one number. For example, if the last case scraped was case number 24O17002002, you will want to make the lower bound 2003.
-
Enter the upper bound (1-4 digit integer): It is recommended to not try to scrape too many at a time, this sometimes results in errors. Scrape about 1500 at a time. The upper bound should be about 1500 higher a number than your lower bound.
A. Sometimes it is difficult to know how far to scrape because you don't know how many total cases there are for a given year. One suggestion is to look at the dates of the cases you have scrapped. If you have scrapped cases up to December 30 of that year than you have scraped all of the cases. To be safe put in a higher upper bound than the last cases for December and if nothing comes back then you will know you got them all. This can be a bit of trial and error because there may be big gaps in the case numbers for the foreclosure cases we want to scrape.
-
Enter 1 for debug.
-
Enter 1 for scrape.
-
Enter 1 for mine.
10.Enter 1 for clean.
Save completed file in the appropriate folder for raw data- make sure mortgage foreclosure data is saved with mortgage data and tax with tax. There will be additional files that are saved to your desktop during the scraping process. These can be moved to the trash after the raw file is saved.
*NOTE: If you are scraping a lot of data be sure to make each file pathway a unique name, saving multiple scraping sessions as the same csv name will cause errors.
Background & Resources
MD Access to Judicial Records
mdcourts
Other Helpful search tips:
http://www.registers.state.md.us. EstateSearch provides public Internet access to information from estate records maintained by the Maryland Registers of Wills. This information includes decedent’s name, estate number and status, date of death, date of filing, personal representative, attorney, decedent alias, and docket history. All records are available from 1998 to present for all estates filed with the Register of Wills’ office. The data is updated daily at the end of the business day.
Use a % as a wildcard when searching in a field (Smith%) would give you all names that start with Smith, Smithson, Smithsburg, Smithman, etc.
When searching for a date range you need to enter a last name or first name (partials allowed)
You can sort the columns by clicking on the column header.
Click the Search again option to take you back to your previous search criteria.
Use the clear button to clear all fields and begin your search again.
List of Estate Types
- Regular Estate (RE) - Assets subject to administration in excess of $30,000 ($50,000 if the spouse is the sole legatee or heir).
- Regular Estate Judicial (RJ) - A proceeding conducted by the Orphans' Court when matters cannot be handled administratively. For example, when the validity of the will is at issue, or the will is lost, stolen or damaged.
- Small Estate (SE) - Assets subject to administration valued at $30,000 or less ($50,000 if the spouse is the sole legatee or heir).
-Small Estate Judicial (SJ) - A proceeding conducted by the Orphans' Court when matters cannot be handled administratively. For example, when the validity of the will is at issue, or the will is lost, stolen or damaged.
- Foreign Proceeding (FP) - Decedent domiciled out of state with real property in Maryland.
- Motor Vehicle (MV) - Transfer of motor vehicle only.
- NonProbate (NP) - Property of the decedent which passes by operation of law such as a joint tenancy, tenants by the entireties, or property passing under a deed or trust, revocable or irrevocable. Non probate property must be reported to the Register of Wills on the Information Report or Application to Fix Inheritance Tax on Non-Probate assets.
- Unprobated Will Only (UN) - Will and Information Report filed with will and/or Application to Fix Inheritance Tax.
- Modified Administration (MA) - A procedure available when the residual legatees consists of the personal representative, spouse; and children. Estate is solvent, Final Distribution can occur within 12 months from date of appointment. A verified final report is filed within 10 months from the date of appointment.
- Guardianship Estate (GE) - Guardianship of property for a minor.
- Limited Order (LO) – A limited order to locate assets or a will.
Setup
! pip install mechanicalsoup
! pip install urlopen#export
import mechanicalsoup
mechanicalsoup.__version__from google.colab import drive
drive.mount('/gdrive')
%cd /gdrivecd MyDrive/vitalSigns/close_crawlls# For relative imports to work in Python 3.6
import os, sys;
os.path.realpath('./')
sys.path.append(os.path.dirname(os.path.realpath('/MyDrive/vitalSigns/close_crawl/modules/cleaner.py')))
sys.path
Scrape
#export
"""settings
Configuration settings and global variables for the entire project. This file
is intended to only be used as a non-executable script.
"""
from os import path
# browser settings
HEADER = ("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1)"
" Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1")
URL = "http://casesearch.courts.state.md.us/casesearch/"
CASE_PAT = "24{type}{year}00{num}"
# scraping parameters
CASE_TYPES = ('O', 'C')
# temporary directory settings
HTML_DIR = "responses"
HTML_FILE = path.join(HTML_DIR, "{case}")
# log file settings
CHECKPOINT = "checkpoint.json"
NO_CASE = "no_case.json"
# data mining settings
FEATURES = [
"Filing Date",
"Case Number",
"Case Type",
"Title",
"Plaintiff",
"Defendant",
"Address",
"Business or Organization Name",
"Party Type",
]
FIELDS = FEATURES + [ "Zip Code", "Partial Cost" ]
INTERNAL_FIELDS = [ "Business or Organization Name", "Party Type"]
Local Browser
#export
"""local_browser
NOTICE: Close Crawl runs its browser form submissions through Mechanize.
The module, however, is deprecated and does not support Python 3. The more
stable and maintained Mechanize and BeautifulSoup wrapper, MechanicalSoup,
will be replacing the Mechanize methods to support Python 3.
This module contains the configurations and settings for the browser used for
crawling and scraping through the pages in Close Crawl. The script contains the
implementation of the Session class which inherits attributes from the classobj
mechanize.Browser()
The script works as an internal module for Close Crawl, but can be imported
as a module for testing purposes.
TODO:
Replace deprecated Mechanize with MechanicalSoup
Fork Mechanize to support Python3
"""#export
from __future__ import absolute_import, print_function, unicode_literals
# import cookielib
import http.cookiejar as cookielib # for Python3
import warnings
from urllib.request import urlopen
# from urllib import urlopen urllib.request
## from mechanize import Browser, _http
import mechanicalsoup
# from settings import HEADER, URL
warnings.filterwarnings("ignore", category=UserWarning)#export
class Session(object):
def __init__(self):
"""Constructor
Args:
None
Attributes:
browser (`mechanize._mechanize.Browser`): browser object in session
"""
self.browser = mechanicalsoup.StatefulBrowser()
# set error and debug handlers for the browser
# cookie jar
self.browser.set_cookiejar(cookielib.LWPCookieJar())
# browser options
# self.browser.set_handle_equiv(True)
# self.browser.set_handle_gzip(True)
# self.browser.set_handle_redirect(True)
# self.browser.set_handle_referer(True)
# self.browser.set_handle_robots(False)
# follows refresh 0 but doesn't hang on refresh > 0
#self.browser.set_handle_refresh( _http.HTTPRefreshProcessor(), max_time=1 )
# user-Agent
# self.browser.addheaders = [("User-agent", HEADER)]
def close(self):
"""Destructor for Session. Closes current browser session
Args:
None
Returns:
None
"""
self.browser.close()
def case_id_form(self, case):
"""Grabs the form in the case searching page, and inputs the
case number to return the response.
Args:
case (`str`): case ID to be scraped
Returns:
response (`str`): HTML response
"""
# iterate through the forms to find the correct one
#for form in self.browser.forms():
# if form.attrs["name"] == "inquiryFormByCaseNum":
# self.browser.form = form
# break
self.browser.select_form('form[action="/casesearch/inquiryByCaseNum.jis"]')
# submit case ID and return the response
self.browser["caseId"] = case
response = self.browser.submit_selected()
response = response.text
# if any( case_type in response.upper() for case_type in ("FORECLOSURE", "FORECLOSURE RIGHTS OF REDEMPTION", "MOTOR TORT") ): print (response.upper)
self.browser.open("http://casesearch.courts.state.md.us/casesearch/inquiryByCaseNum.jis")
# , "MOTOR TORT"
return response if any(
case_type in response.upper() for case_type in
("FORECLOSURE", "FORECLOSURE RIGHTS OF REDEMPTION")
) else False
def disclaimer_form(self):
"""Navigates to the URL to proceed to the case searching page
Args:
None
Returns:
None
"""
# visit the site
print(URL)
self.browser.open("http://casesearch.courts.state.md.us/casesearch/")
# select the only form on the page
self.browser.select_form('form')
# select the checkbox
self.browser["disclaimer"] = ['Y']
# submit the form
self.browser.submit_selected()
@staticmethod
def server_running():
"""Checks the status of the Casesearch servers
Args:
None
Returns:
`True` if server is up, `False` otherwise
"""
return urlopen(URL).getcode() == 200
Spider
#export
from __future__ import absolute_import, print_function, unicode_literals
from json import dumps, load
from os import path, makedirs
from random import uniform
import sys
from time import sleep
from tqdm import trange
# from local_browser import Session
# from settings import CASE_PAT, CHECKPOINT, HTML_DIR, HTML_FILE#export
class Spider(object):
def __init__(self, case_type, year, bounds=range(1, 6), gui=False):
# initial disclaimer page for terms and agreements
self.browser = Session()
if not self.browser.server_running():
sys.exit("Server is unavailable at the moment")
print (self.browser)
self.browser.disclaimer_form()
self.WAITING_TIME = 0
self.case_type = case_type
self.year = year
self.bounds = bounds
if not path.exists(HTML_DIR):
makedirs(HTML_DIR)
def save_response(self):
case_range = trange(
len(self.bounds), desc="Crawling", leave=True
)
for case_num in case_range:
if case_num and not case_num % 500:
print("500 CASES SCRAPED. SCRIPT WILL WAIT 5 MINUTES TO RESUME")
for i in range(150, 0, -1):
sleep(1)
sys.stdout.write('\r' + "%02d:%02d" % divmod(i, 60))
sys.stdout.flush()
case = CASE_PAT.format(
type=self.case_type,
year=self.year,
num="{:04d}".format(int(str(self.bounds[case_num])[-4:]))
)
try:
wait = uniform(0.0, 0.5)
sleep(wait)
self.WAITING_TIME += wait
case_range.set_description("Crawling {}".format(case))
stripped_html = self.browser.case_id_form(case)
# print('returend this ' , stripped_html)
if stripped_html:
with open(
HTML_FILE.format(case=case) + ".html", 'w'
) as case_file:
case_file.write(str(stripped_html))
# pause process
except KeyboardInterrupt:
self.dump_json({
"error_case":
"{:04d}".format(int(str(self.bounds[case_num])[-4:])),
"year": self.year,
"type": self.type
})
print("Crawling paused at", case)
break
# case does not exist
except IndexError:
self.dump_json({"error_case": case})
print(case, "does not exist")
break
# close browser and end session
self.close_sesh()
@staticmethod
def dump_json(data):
with open(CHECKPOINT, "r+") as checkpoint:
checkpoint_data = load(checkpoint)
for key, val in data.items():
checkpoint_data[key] = val
checkpoint_data[key] = data
checkpoint.seek(0)
checkpoint.write(dumps(checkpoint_data))
checkpoint.truncate()
def close_sesh(self):
self.browser.close()
This will do what you want it to do.
#export
scrape = True
if scrape:
spider = Spider(
case_type='O', year=20,
bounds=range(0000, 9000), gui=False
)
spider.save_response()
wait = spider.WAITING_TIME
Miner
Patterns
#export
"""Patterns
Regular expression patterns and string filtering functions implemented in the
project. This file is intended to only be used as a non-executable script.
TODO:
Finish docs
(\d{1,4}\s[\w\s]{1,20}((?:st(reet)?|ln|lane|ave(nue)?|r(?:oa)?d|highway|hwy|
dr(?:ive)?|sq(uare)?|tr(?:ai)l|c(?:our)?t|parkway|pkwy|cir(cle)?|ter(?:race)?|
boulevard|blvd|pl(?:ace)?)\W?(?=\s|$))(\s(apt|block|unit)\W?([A-Z]|\d+))?)
"""#export
from re import compile as re_compile
from re import I as IGNORECASE
from string import punctuation#export
PUNCTUATION = punctuation.replace('#', '') # all punctuations except '#'
street_address = re_compile(
"(" # begin regex group
"\d{1,4}\s" # house number
"[\w\s]{1,20}" # street name
"(" # start street type group
"(?:st(reet)?|ln|lane|ave(nue)?" # (st)reet, lane, ln, (ave)nue
"|r(?:oa)?d|highway|hwy|dr(?:ive)?" # rd, road, hwy, highway, (dr)ive
"|sq(uare)?|tr(?:ai)l|c(?:our)?t" # (sq)uare, (tr)ail, ct, court
"|parkway|pkwy|cir(cle)?|ter(?:race)?" # parkway, pkwy, (cir)cle, (ter)race
"|boulevard|blvd|pl(?:ace)?" # boulevard, bvld, (pl)ace
"\W?(?=\s|$))" # look ahead for whitespace or end of string
")" # end street type group
"(\s(apt|block|unit)(\W|#)?([\d|\D|#-|\W])+)?" # apt, block, unit number
")", # end regex group
IGNORECASE # case insensitive flag
)
# case insensitive delimiter for Titles
TITLE_SPLIT_PAT = re_compile(" vs ", IGNORECASE)
# pattern for Baltimore zip codes
ZIP_STR = "2\d{4}"
ZIP_PAT = re_compile(ZIP_STR)
# regex pattern to capture monetary values between $0.00 and $999,999,999.99
# punctuation insensitive
MONEY_STR = "\$\d{,3},?\d{,3},?\d{,3}\.?\d{2}"
MONEY_PAT = re_compile(MONEY_STR)
NULL_ADDR = re_compile(
"^("
"(" + MONEY_STR + ")"
"|(" + ZIP_STR + ")"
"|(\d+)"
"|(" + ZIP_STR + ".*" + MONEY_STR + ")"
")$",
IGNORECASE
)
STRIP_ADDR = re_compile(
"(balto|" + ZIP_STR + "|md|" + MONEY_STR + ").*",
IGNORECASE
)
def filter_addr(address):
try:
return ''.join(
street_address.search(
address.translate( str.maketrans('','', PUNCTUATION) )
).group(0)
)
except AttributeError:
return ''
Miner
#export
"""Miner"""
from __future__ import absolute_import, print_function, unicode_literals
from csv import DictWriter
from json import dump, dumps, load
from os import path
from bs4 import BeautifulSoup
from tqdm import trange
# from patterns import MONEY_PAT, TITLE_SPLIT_PAT, ZIP_PAT, filter_addr
# from settings import HTML_FILE, NO_CASE#export
# data mining settings
FEATURES = [
"Filing Date",
"Case Number",
"Case Type",
"Title",
"Plaintiff",
"Defendant",
"Address",
"Business or Organization Name",
"Party Type",
]
FIELDS = FEATURES + [ "Zip Code", "Partial Cost" ]
INTERNAL_FIELDS = [ "Business or Organization Name", "Party Type"]
class Miner(object):
def __init__(self, responses, output, debug=False):
self.responses = responses
self.output = output
self.debug = debug
self.dataset = []
self.maybe_tax = False
self.features = [i + ':' for i in FEATURES]
def scan_files(self):
case_range = trange(len(self.responses), desc="Mining", leave=True) \
if not self.debug else range(len(self.responses))
for file_name in case_range:
with open(
HTML_FILE.format(case=self.responses[file_name]), 'r'
) as html_src:
if not self.debug:
case_range.set_description(
"Mining {}".format(self.responses[file_name])
)
feature_list = self.scrape(html_src.read())
row = self.distribute(feature_list)
if not row:
if not path.isfile(NO_CASE):
with open(NO_CASE, 'w') as no_case_file:
dump([], no_case_file)
with open(NO_CASE, "r+") as no_case_file:
no_case_data = load(no_case_file)
no_case_data.append(str(self.responses[file_name][:-5]))
no_case_file.seek(0)
no_case_file.write(dumps(sorted(set(no_case_data))))
no_case_file.truncate()
self.dataset.extend(row)
def export(self):
file_exists = path.isfile(self.output)
with open(self.output, 'a') as csv_file:
writer = DictWriter(
csv_file,
fieldnames=[
col for col in FIELDS if col not in INTERNAL_FIELDS
]
)
if not file_exists:
writer.writeheader()
for row in self.dataset:
writer.writerow(row)
def scrape(self, html_data):
"""Scrapes the desired features
Args:
html_data:
, source HTML
Returns:
scraped_features: , features scraped and mapped from content
"""
soup = BeautifulSoup(html_data, "html.parser")
# Search for the word 'tax in the document'
if "tax" in soup.text.lower():
self.maybe_tax = True
# Create an array from all TR's with the inner HTML for each TR
# Data we want is stored inside an arbitrary # of 'span' tags inside the TR's.\
tr_list = soup.find_all("tr")
# This will create an array for each TR with an array of SPAN values inside.
feature_list = []
for tag in tr_list:
try:
# Create an innerhtml array for all spans within a single TR
tag = [j.string for j in tag.findAll("span")]
if set(tuple(tag)) & set(self.features):
try:
# Save the spans inner HTML if its not a header label
tag = [i for i in tag if "(each" not in i.lower()]
except AttributeError:
continue
feature_list.append(tag)
except IndexError:
continue
# feature_list is an array [tr] of arrays [spans]. we want this flattened.
# [tr1span1KEY, tr1span1VALUE, tr1span2KEY, tr1span2VALUE, tr2span1KEY, tr2span1VALUE, ]
try:
# flatten multidimensional list
feature_list = [
item.replace(':', '')
for sublist in feature_list for item in sublist
]
except AttributeError:
pass
return feature_list
def distribute(self, feature_list):
# feature_list ~= [html][tr][spans].innterHTML
# [tr1span1KEY, tr1span1VALUE, tr1span2KEY, tr1span2VALUE, tr2span1KEY, tr2span1VALUE, ]
def __pair(list_type):
# break up elements with n-tuples greater than 2
def __raw_business(i): return any( x in feature_list[i:i + 2][0] for x in INTERNAL_FIELDS )
def __feature_list(i): return feature_list[i:i + 2][0] in FEATURES
condition = __raw_business if list_type else __feature_list
# then convert list of tuples to dict for faster lookup
return [ tuple(feature_list[i:i + 2]) for i in range(0, len(feature_list), 2) if condition(i) ]
raw_business = __pair(1) # [(x1,y1),(x2,y2),(x3,y3)] => INTERNAL_FIELDS
feature_list = dict(__pair(0)) # FEATURES
filtered_business = []
# Party_Type = 'property address' not 'plaintiff' or 'defendant'
# Input exists 'Business or Org Name' and == an Address
for label, value in enumerate(raw_business):
try:
party_type = value[1].upper()
section = raw_business[label + 1][0].upper()
flag1 = party_type == "PROPERTY ADDRESS"
flag2 = section == "BUSINESS OR ORGANIZATION NAME"
if flag1 and flag2: filtered_business.append(raw_business[label + 1])
except IndexError:
print("Party Type issue at Case", feature_list["Case Number"])
scraped_features = []
for address in filtered_business:
str_address = filter_addr(str(address[-1]))
temp_features = {
key: value for key, value in feature_list.items()
if key in ["Title", "Case Type", "Case Number", "Filing Date"]
}
if temp_features["Case Type"].upper() == "FORECLOSURE":
temp_features["Case Type"] = "Mortgage"
elif temp_features["Case Type"].upper() == \
"FORECLOSURE RIGHTS OF REDEMPTION" and self.maybe_tax:
temp_features["Case Type"] = "Tax"
else:
# break out of the rest of the loop if case type is neither
continue
if 'Title' not in temp_features:
# print('feature_list');
# print(feature_list);
# print('\n \n raw_business');
# print(raw_business);
continue
# break up Title feature into Plaintiff and Defendant
try:
temp_features["Plaintiff"], temp_features["Defendant"] = \
TITLE_SPLIT_PAT.split(temp_features["Title"])
except ValueError:
temp_features["Plaintiff"], temp_features["Defendant"] = \
(", ")
temp_features["Address"] = \
str_address if str_address else address[-1]
temp_features["Zip Code"] = ''.join(ZIP_PAT.findall(address[-1]))
temp_features["Partial Cost"] = ''.join(
MONEY_PAT.findall(address[-1])
)
scraped_features.append(temp_features)
temp_features = {}
return scraped_features
Now take each html file from the responses folder and mine them for data
#export
# from settings import CHECKPOINT, HTML_DIR
from os import path, remove, walk
temp_output = "temp_data.csv"
file_array = [filenames for (dirpath, dirnames, filenames) in walk(HTML_DIR)][0]
top5 = file_array[:1]miner = Miner(file_array, temp_output)miner.scan_files()miner.export()ls"miner.output = 'tem_dat.csv'"'miner.output'"""scrapethisfile = '24O19000982.html'
with open( HTML_FILE.format(case=scrapethisfile), 'r' ) as html_src:
print( scrapethisfile);
feature_list = scrape(html_src.read())
print(feature_list);
row = distribute(feature_list)
"""Manual Test
soup = ''
scrapethisfile = '24O19000982.html'
with open( HTML_FILE.format(case=scrapethisfile), 'r' ) as html_src:
souptxt = html_src.read()
soup = BeautifulSoup(souptxt, "html.parser")# from settings import CHECKPOINT, HTML_DIR
from os import path, remove, walk
temp_output = "temp_dat.csv"
file_array = [filenames for (dirpath, dirnames, filenames) in walk(HTML_DIR)][0]
top5 = [ file_array[41] ]miner = Miner(top5, temp_output)miner.scan_files()Cleaner
#export
"""Cleaner
This module implements post-scraping cleaning processes on the raw initial
dataset. Processes include stripping excess strings off Address values,
removing Zip Code and Partial Cost values mislabeled as Address, and merging
rows containing blank values in alternating features.
The script works as an internal module for Close Crawl, but can be executed
as a standalone to manually process datasets:
$ python cleaner.py
"""#export
from __future__ import absolute_import, print_function, unicode_literals
from pandas import DataFrame, concat, read_csv, to_datetime
# from patterns import NULL_ADDR, STRIP_ADDR, filter_addr, punctuation#export
class Cleaner(object):
"""Class object for cleaning the raw dataset extracted after the initial
scraping
"""
def __init__(self, path):
"""Constructor for Cleaner
Args:
path (`str`): path to input CSV dataset
Attributes:
df (`pandas.core.frame.DataFrame`): initial DataFrame
columns (`list` of `str`): columns of the DataFrame
clean_df (`pandas.core.frame.DataFrame`): final DataFrame to be
outputted
"""
self.df = self.prettify(read_csv(path))
self.columns = list(self.df)
self.clean_df = []
@staticmethod
def prettify(df, internal=True):
"""Drops duplicates, sorts and fills missing values in the DataFrame
to make it manageable.
Args:
df (`pandas.core.frame.DataFrame`): DataFrame to be managed
internal (`bool`, optional): flag for determining state of
DataFrame
Returns:
df (`pandas.core.frame.DataFrame`): organized DataFrame
"""
df.drop_duplicates(inplace=True, keep='last', subset=["Case Number"] )
df["Filing Date"] = to_datetime(df["Filing Date"])
df.sort_values(
["Filing Date", "Case Number", "Address"],
ascending=[True] * 3,
inplace=True
)
if internal:
df["Zip Code"] = df["Zip Code"].fillna(0.0).astype(int)
df["Zip Code"] = df["Zip Code"].replace(0, '')
return df
def clean_addr(self):
"""Cleans excess strings off Address values and removes Zip Code and
Partial Cost values mislabeled as Address.
Args:
None
Returns:
None
"""
def clean_string(addr):
"""Applies regular expressions and other filters on Address
values
Args:
addr (`str`): Address value to be filtered
Returns:
addr (`str`): filtered Address value
"""
# if value does not match the street_address pattern
if not filter_addr(addr): # patterns.filter_addr
if NULL_ADDR.sub('', addr): # value may contain valid Address
return str(
STRIP_ADDR.sub(
'', addr) # strip off Zip Code and Partial Cost
).translate(
{ord(c): None for c in punctuation}
).strip() # strip off punctuations
return addr
print("Cleaning addresses...", end=" ")
self.df["Address"] = self.df["Address"].apply(
lambda x: clean_string(x)
)
self.df["Address"] = self.df["Address"].apply(
lambda x: NULL_ADDR.sub('', x)
)
# replace empty string values with NULL
self.df["Zip Code"] = self.df["Zip Code"].replace('', float("nan"))
self.df["Address"] = self.df["Address"].replace('', float("nan"))
print("Done")
@staticmethod
def combine_rows(row):
"""Merges rows after filtering out common values
Args:
row (`list` of `list` of `str`): groupby("Case Number") rows
Returns:
(`list` of `str`): merged row
"""
def __filter_tuple(col):
"""Filters common values from rows
Args:
col (`tuple` of `str`): values per column
Returns:
value (`str`): common value found per mergeable rows
"""
for value in set(col):
if value == value: # equivalent to value != NaN
return value
return [__filter_tuple(x) for x in zip(*row)]
@staticmethod
def mergeable(bool_vec):
"""Determines if groupby("Case Number") rows are mergeable
Example:
bool_vec = [
[True, True, True, True, True, True, False, True, True],
[True, True, True, True, True, True, True, False, False],
[True, True, True, True, True, True, False, False, False]
]
__sum_col(bool_vec) -> [3, 3, 3, 3, 3, 3, 1, 1, 1]
__bool_pat(__sum_col(bool_vec)) -> True
Args:
bool_vec (`list` of `bool`): represents non-NULL values
Returns:
(`bool`): True if rows are mergeable
"""
def __sum_col():
"""Sums columns
Args:
None
Returns:
(`list` of `int`): sum of columns
"""
return [sum(x) for x in zip(*bool_vec)]
def __bool_pat(row):
"""Determines mergeability
Args:
None
Returns:
(`bool`): True if rows are mergeable
"""
return set(row[-3:]) == set([1]) and set(row[:-3]) != set([1])
return True if __bool_pat(__sum_col()) else False
def merge_nulls(self):
"""Splits DataFrames into those with NULL values to be merged, and then
later merged with the original DataFrame
Args:
None
Returns:
None
"""
print("Merging rows...", end=" ")
print(self.df)
# filter out rows with any NULL values
origin_df = self.df.dropna()
# filter out rows only with NULL values
null_df = self.df[self.df.isnull().any(axis=1)]
# boolean representation of the DataFrame with NULL values
bool_df = null_df.notnull()
# (`list` of `dict` of `str` : `str`) to be converted to a DataFrame
new_df = []
for i in null_df["Case Number"].unique():
bool_row = bool_df[null_df["Case Number"] == i]
new_row = null_df[null_df["Case Number"] == i]
# if the rows are mergeable, combine them
if self.mergeable(bool_row.values):
new_row = self.combine_rows(new_row.values.tolist())
new_df.append(
{
feature: value
for feature, value in zip(self.columns, new_row)
}
)
# else, treat them individually
else:
new_row = new_row.values.tolist()
for row in new_row:
new_df.append(
{
feature: value
for feature, value in zip(self.columns, row)
}
)
# merge the DataFrames back
self.clean_df = concat(
[origin_df, DataFrame(new_df)]
).reset_index(drop=True)
# prettify the new DataFrame
self.clean_df = self.prettify(
self.clean_df[self.columns], internal=False
)
print("Done")
def init_clean(self):
"""Initializes cleaning process
Args:
None
Returns:
None
"""
print(self.df)
self.clean_addr()
print(self.df)
self.merge_nulls()
def download(self, output_name):
"""Downloads the cleaned and manipulated DataFrame into a CSV file
Args:
output_name (`str`): path of the new output file
Returns:
None
"""
self.clean_df.to_csv(output_name, index=False)
#export
temp_output = "temp_data.csv"
output = 'outputfile.csv'
df_obj = Cleaner(temp_output)
df_obj.init_clean()
df_obj.download(output)Main
#export
"""main
The main executable script for Close Crawl. This file manages types, flags
and constraints for the case type, year and output data file.
Usage:
$ python main.py
Example usage:
$ python main.py O 2015 test_set.csv -l=300 -u=600 -d=1
"""#export
from __future__ import absolute_import, print_function, unicode_literals
from json import dump, dumps, load
from os import path, remove, walk
from shutil import rmtree
from time import time
from cleaner import Cleaner
from miner import Miner
from settings import CHECKPOINT, HTML_DIR
from spider import Spider#export
def close_crawl(case_type, case_year, output, cases='', lower_bound=0,
upper_bound=0, debug=False, scrape=True, mine=True,
clean=True):
"""Main function for Close Crawl.
Args:
case_type (`str`): type of foreclosure case, options are 'O' and 'C'
case_year (`str`): year of foreclosure cases
output (`str`): path of the output CSV file, along with the valid
extension (.csv)
lower_bound (`int`, optional): lower bound of range of cases
upper_bound (`int`, optional): upper bound of range of cases
debug (`bool`, optional): option for switching between debug mode.
Default -> True
Returns:
None
"""
temp_output = "temp_data.csv"
wait = 0
case_list = []
print('checkpoint')
if not path.isfile(CHECKPOINT):
print("Initializing project...")
with open(CHECKPOINT, "w") as checkpoint:
dump(
{
"last_case": "{:04d}".format(int(str(lower_bound)[-4:])),
"type": case_type,
"year": case_year[-2:],
"error_case": '',
},
checkpoint
)
print('cases')
if not cases:
with open(CHECKPOINT) as checkpoint:
prev_bound = int(load(checkpoint)["last_case"])
if not lower_bound:
lower_bound = prev_bound
upper_bound = upper_bound if int(upper_bound) > int(lower_bound) \
else str(lower_bound + 5)
case_list = range(int(lower_bound), int(upper_bound) + 1)
else:
with open(cases) as manual_cases:
case_list = sorted(list(set(load(manual_cases))))
print('scrape')
if scrape:
spider = Spider(
case_type=case_type, year=case_year[-2:],
bounds=case_list, gui=False
)
spider.save_response()
wait = spider.WAITING_TIME
file_array = [filenames for (dirpath, dirnames, filenames)
in walk(HTML_DIR)][0]
start_mine = time()
print('mine')
if mine:
miner = Miner(file_array, temp_output)
miner.scan_files()
miner.export()
print('clean')
if clean:
df_obj = Cleaner(temp_output)
df_obj.init_clean()
df_obj.download(output)
print('save')
with open(CHECKPOINT, "r+") as checkpoint:
checkpoint_data = load(checkpoint)
checkpoint_data["last_case"] = sorted(file_array)[-1].split('.')[0][-4:]
checkpoint.seek(0)
checkpoint.write(dumps(checkpoint_data))
checkpoint.truncate()
print("Crawling runtime: {0:.2f} s".format((end_crawl - start_crawl)))
print(
"Downloading runtime: {0:.2f} s".format(
((end_crawl - start_crawl) - wait))
)
print("Mining runtime: {0:.2f} s".format((end_mine - start_mine)))
print("Program runtime: {0:.2f} s".format((end - start)))
print("------------ SCRAPING COMPLETED ------------")
#export
args = {'type': 'C', 'year': '2014', 'output': 'outputfile.csv', 'file': '', 'lower': '1000', 'upper': '2000', 'debug': '0', 'scrape': True, 'mine': True, 'clean': True}
close_crawl(
case_type=args["type"], case_year=args["year"], output=args["output"],
cases=args["file"], lower_bound=args["lower"],
upper_bound=args["upper"], debug=args["debug"],
scrape=args["scrape"], mine=args["mine"],
clean=args["clean"]
)CLI
lscd ../! python cliargs.py -l=50 -u=3500 -d -s -m C 2016 output.csvls#export
"""cliargs
The main command line script for Close Crawl. This file manages types, flags
and constraints for the case type, year and output data file as well as the
processing options.
Parameters:
{O,C} | Type of foreclosure cases
year | Year of foreclosure cases
output | Path of output file
Optional parameters:
-h, --help | Show this help message and exit
-v, --version| Show program's version number and exit
-l, --lower | Lower bound of range of cases
-u, --upper | Upper bound of range of cases
-f, --file | Path of JSON array of cases
-d, --debug | Debug mode
-s, --scrape | Scrape only
-m, --mine | Mine only
-c, --clean | Clean only
Usage:
$ python cliarg.py [-h] [-v] [-l] [-u] [-f] [-d] [-s] [-m] [-c]
{O,C} year output
Example usages:
$ python cliarg.py -l=50 -u=3500 -d -s -m C 2016 output.csv
$ python cliarg.py -c="cases_to_scrape.json" -d O 2014 output01.csv
"""#export
from __future__ import absolute_import, print_function, unicode_literals
import sys
from textwrap import dedent
from modules import main#export
#!/usr/bin/env python
# -*- coding: utf-8 -*-
if __name__ == "__main__":
args = {}
args["type"] = input("Enter type of case (1 char: {C, O}): ")
args["year"] = input("Enter year of case (4 digit int): ")
args["output"] = input("Enter name of output file (CSV file path): ")
opt = int(input("Enter 0 for manual parameters or 1 for automatic: "))
if bool(opt):
args["file"] = input("Enter name of cases file (JSON file path): ")
args["lower"] = args["upper"] = 0
else:
args["file"] = ""
args["lower"] = input("Enter lower bound of cases (1-4 digit int): ")
args["upper"] = input("Enter upper bound of cases (1-4 digit int): ")
args["debug"] = input(
"Enter 0 for default mode, 1 for debug (1 digit int): "
)
print(
dedent(
"""Processing options:\n\n"""
"""For the following options, enter 0 to disable or 1 to enable."""
"""\nNOTE: The script will exit if all but the mining step is """
"""enabled - data cannot be cleaned without being mined first."""
)
)
args["scrape"] = bool(input("Scrape: {0, 1}: "))
args["mine"] = bool(input("Mine: {0, 1}: "))
args["clean"] = bool(input("Clean: {0, 1}: "))
# exit script if all but the mining step is enabled
if (args["scrape"] and not(args["mine"]) and args["clean"]):
sys.exit("\nData cannot be cleaned without being mined first.")
print (args)
main.close_crawl(
case_type=args["type"], case_year=args["year"], output=args["output"],
cases=args["file"], lower_bound=args["lower"],
upper_bound=args["upper"], debug=args["debug"],
scrape=args["scrape"], mine=args["mine"],
clean=args["clean"]
)
Test
ls! python cliargs.py -h! python cliargs.py -d -m -c C 2019 output.csv! python cliargs.py -l=9500 -u=9999 -c -s -d -m O 2019 output.csvlsif both packages are in your import path (sys.path), and the module/class you want is in example/example.py, then to access the class without relative import try: from example.example import fkt
if name == 'main':
from mymodule import as_int
else:
from .mymodule import as_int
https://stackoverflow.com/questions/448271/what-is-init-py-for
Files named init.py are used to mark directories on disk as Python package directories. If you have the files
mydir/spam/init.py
mydir/spam/module.py
and mydir is on your path, you can import the code in module.py as
import spam.module
or
from spam import module