# default_exp closecrawl

Intro

It goes CLI CLIARGS > MAIN> SPIDER > MINER > CLEANER

To calculate foreclosure we use mortgage foreclosures(O) as apposed to tax Foreclosures(C).

We have only ever looked at tax foreclosures for specific projects like BOLD.

Instructions for obtaining foreclosure case data from MD Case Search with Web Scraper: (These instructions are for the scraper designed in 2017 forward)


Go to GitHub and download the files to have the web scraper installed on your desktop (or you can run it directly from GitHub) https://github.com/BNIA/Close-Crawl

The usage interface is described here: https://github.com/BNIA/Close-Crawl/blob/master/docs/cli/README.md

  1. Enter Case type: C for tax sale foreclosures, O for mortgage foreclosures

  2. Enter year: (Year must be 4 digits)

  3. Enter name of output file name (must be csv file path, this will be saved onto your desktop)

  4. Enter 0 for manual parameters or 1 for automatic (0 is recommended)

  5. Enter the lower bound for the cases to scrape (1-4 digit integer). A. If you are scraping cases from a year for the first time that year, you will want to start with "1". B. In all other scenarios you need to go look at the last data scraped in the raw folder of that dataset. Look at the case number for the last case that was scraped, look at the last for digits and go up by one number. For example, if the last case scraped was case number 24O17002002, you will want to make the lower bound 2003.

  6. Enter the upper bound (1-4 digit integer): It is recommended to not try to scrape too many at a time, this sometimes results in errors. Scrape about 1500 at a time. The upper bound should be about 1500 higher a number than your lower bound. A. Sometimes it is difficult to know how far to scrape because you don't know how many total cases there are for a given year. One suggestion is to look at the dates of the cases you have scrapped. If you have scrapped cases up to December 30 of that year than you have scraped all of the cases. To be safe put in a higher upper bound than the last cases for December and if nothing comes back then you will know you got them all. This can be a bit of trial and error because there may be big gaps in the case numbers for the foreclosure cases we want to scrape.

  7. Enter 1 for debug.

  8. Enter 1 for scrape.

  9. Enter 1 for mine.

10.Enter 1 for clean.

Save completed file in the appropriate folder for raw data- make sure mortgage foreclosure data is saved with mortgage data and tax with tax. There will be additional files that are saved to your desktop during the scraping process. These can be moved to the trash after the raw file is saved.

*NOTE: If you are scraping a lot of data be sure to make each file pathway a unique name, saving multiple scraping sessions as the same csv name will cause errors.

Background & Resources

MD Access to Judicial Records

mdcourts

Other Helpful search tips:

http://www.registers.state.md.us. EstateSearch provides public Internet access to information from estate records maintained by the Maryland Registers of Wills. This information includes decedent’s name, estate number and status, date of death, date of filing, personal representative, attorney, decedent alias, and docket history. All records are available from 1998 to present for all estates filed with the Register of Wills’ office. The data is updated daily at the end of the business day.

Use a % as a wildcard when searching in a field (Smith%) would give you all names that start with Smith, Smithson, Smithsburg, Smithman, etc. When searching for a date range you need to enter a last name or first name (partials allowed) You can sort the columns by clicking on the column header. Click the Search again option to take you back to your previous search criteria. Use the clear button to clear all fields and begin your search again.

List of Estate Types

  • Regular Estate (RE) - Assets subject to administration in excess of $30,000 ($50,000 if the spouse is the sole legatee or heir).
  • Regular Estate Judicial (RJ) - A proceeding conducted by the Orphans' Court when matters cannot be handled administratively. For example, when the validity of the will is at issue, or the will is lost, stolen or damaged.
  • Small Estate (SE) - Assets subject to administration valued at $30,000 or less ($50,000 if the spouse is the sole legatee or heir). -Small Estate Judicial (SJ) - A proceeding conducted by the Orphans' Court when matters cannot be handled administratively. For example, when the validity of the will is at issue, or the will is lost, stolen or damaged.
  • Foreign Proceeding (FP) - Decedent domiciled out of state with real property in Maryland.
  • Motor Vehicle (MV) - Transfer of motor vehicle only.
  • NonProbate (NP) - Property of the decedent which passes by operation of law such as a joint tenancy, tenants by the entireties, or property passing under a deed or trust, revocable or irrevocable. Non probate property must be reported to the Register of Wills on the Information Report or Application to Fix Inheritance Tax on Non-Probate assets.
  • Unprobated Will Only (UN) - Will and Information Report filed with will and/or Application to Fix Inheritance Tax.
  • Modified Administration (MA) - A procedure available when the residual legatees consists of the personal representative, spouse; and children. Estate is solvent, Final Distribution can occur within 12 months from date of appointment. A verified final report is filed within 10 months from the date of appointment.
  • Guardianship Estate (GE) - Guardianship of property for a minor.
  • Limited Order (LO) – A limited order to locate assets or a will.

Setup

! pip install mechanicalsoup ! pip install urlopen#export import mechanicalsoup mechanicalsoup.__version__from google.colab import drive drive.mount('/gdrive') %cd /gdrivecd MyDrive/vitalSigns/close_crawlls# For relative imports to work in Python 3.6 import os, sys; os.path.realpath('./') sys.path.append(os.path.dirname(os.path.realpath('/MyDrive/vitalSigns/close_crawl/modules/cleaner.py'))) sys.path

Scrape

#export """settings Configuration settings and global variables for the entire project. This file is intended to only be used as a non-executable script. """ from os import path # browser settings HEADER = ("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1)" " Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1") URL = "http://casesearch.courts.state.md.us/casesearch/" CASE_PAT = "24{type}{year}00{num}" # scraping parameters CASE_TYPES = ('O', 'C') # temporary directory settings HTML_DIR = "responses" HTML_FILE = path.join(HTML_DIR, "{case}") # log file settings CHECKPOINT = "checkpoint.json" NO_CASE = "no_case.json" # data mining settings FEATURES = [ "Filing Date", "Case Number", "Case Type", "Title", "Plaintiff", "Defendant", "Address", "Business or Organization Name", "Party Type", ] FIELDS = FEATURES + [ "Zip Code", "Partial Cost" ] INTERNAL_FIELDS = [ "Business or Organization Name", "Party Type"]

Local Browser

#export """local_browser NOTICE: Close Crawl runs its browser form submissions through Mechanize. The module, however, is deprecated and does not support Python 3. The more stable and maintained Mechanize and BeautifulSoup wrapper, MechanicalSoup, will be replacing the Mechanize methods to support Python 3. This module contains the configurations and settings for the browser used for crawling and scraping through the pages in Close Crawl. The script contains the implementation of the Session class which inherits attributes from the classobj mechanize.Browser() The script works as an internal module for Close Crawl, but can be imported as a module for testing purposes. TODO: Replace deprecated Mechanize with MechanicalSoup Fork Mechanize to support Python3 """#export from __future__ import absolute_import, print_function, unicode_literals # import cookielib import http.cookiejar as cookielib # for Python3 import warnings from urllib.request import urlopen # from urllib import urlopen urllib.request ## from mechanize import Browser, _http import mechanicalsoup # from settings import HEADER, URL warnings.filterwarnings("ignore", category=UserWarning)#export class Session(object): def __init__(self): """Constructor Args: None Attributes: browser (`mechanize._mechanize.Browser`): browser object in session """ self.browser = mechanicalsoup.StatefulBrowser() # set error and debug handlers for the browser # cookie jar self.browser.set_cookiejar(cookielib.LWPCookieJar()) # browser options # self.browser.set_handle_equiv(True) # self.browser.set_handle_gzip(True) # self.browser.set_handle_redirect(True) # self.browser.set_handle_referer(True) # self.browser.set_handle_robots(False) # follows refresh 0 but doesn't hang on refresh > 0 #self.browser.set_handle_refresh( _http.HTTPRefreshProcessor(), max_time=1 ) # user-Agent # self.browser.addheaders = [("User-agent", HEADER)] def close(self): """Destructor for Session. Closes current browser session Args: None Returns: None """ self.browser.close() def case_id_form(self, case): """Grabs the form in the case searching page, and inputs the case number to return the response. Args: case (`str`): case ID to be scraped Returns: response (`str`): HTML response """ # iterate through the forms to find the correct one #for form in self.browser.forms(): # if form.attrs["name"] == "inquiryFormByCaseNum": # self.browser.form = form # break self.browser.select_form('form[action="/casesearch/inquiryByCaseNum.jis"]') # submit case ID and return the response self.browser["caseId"] = case response = self.browser.submit_selected() response = response.text # if any( case_type in response.upper() for case_type in ("FORECLOSURE", "FORECLOSURE RIGHTS OF REDEMPTION", "MOTOR TORT") ): print (response.upper) self.browser.open("http://casesearch.courts.state.md.us/casesearch/inquiryByCaseNum.jis") # , "MOTOR TORT" return response if any( case_type in response.upper() for case_type in ("FORECLOSURE", "FORECLOSURE RIGHTS OF REDEMPTION") ) else False def disclaimer_form(self): """Navigates to the URL to proceed to the case searching page Args: None Returns: None """ # visit the site print(URL) self.browser.open("http://casesearch.courts.state.md.us/casesearch/") # select the only form on the page self.browser.select_form('form') # select the checkbox self.browser["disclaimer"] = ['Y'] # submit the form self.browser.submit_selected() @staticmethod def server_running(): """Checks the status of the Casesearch servers Args: None Returns: `True` if server is up, `False` otherwise """ return urlopen(URL).getcode() == 200

Spider

#export from __future__ import absolute_import, print_function, unicode_literals from json import dumps, load from os import path, makedirs from random import uniform import sys from time import sleep from tqdm import trange # from local_browser import Session # from settings import CASE_PAT, CHECKPOINT, HTML_DIR, HTML_FILE#export class Spider(object): def __init__(self, case_type, year, bounds=range(1, 6), gui=False): # initial disclaimer page for terms and agreements self.browser = Session() if not self.browser.server_running(): sys.exit("Server is unavailable at the moment") print (self.browser) self.browser.disclaimer_form() self.WAITING_TIME = 0 self.case_type = case_type self.year = year self.bounds = bounds if not path.exists(HTML_DIR): makedirs(HTML_DIR) def save_response(self): case_range = trange( len(self.bounds), desc="Crawling", leave=True ) for case_num in case_range: if case_num and not case_num % 500: print("500 CASES SCRAPED. SCRIPT WILL WAIT 5 MINUTES TO RESUME") for i in range(150, 0, -1): sleep(1) sys.stdout.write('\r' + "%02d:%02d" % divmod(i, 60)) sys.stdout.flush() case = CASE_PAT.format( type=self.case_type, year=self.year, num="{:04d}".format(int(str(self.bounds[case_num])[-4:])) ) try: wait = uniform(0.0, 0.5) sleep(wait) self.WAITING_TIME += wait case_range.set_description("Crawling {}".format(case)) stripped_html = self.browser.case_id_form(case) # print('returend this ' , stripped_html) if stripped_html: with open( HTML_FILE.format(case=case) + ".html", 'w' ) as case_file: case_file.write(str(stripped_html)) # pause process except KeyboardInterrupt: self.dump_json({ "error_case": "{:04d}".format(int(str(self.bounds[case_num])[-4:])), "year": self.year, "type": self.type }) print("Crawling paused at", case) break # case does not exist except IndexError: self.dump_json({"error_case": case}) print(case, "does not exist") break # close browser and end session self.close_sesh() @staticmethod def dump_json(data): with open(CHECKPOINT, "r+") as checkpoint: checkpoint_data = load(checkpoint) for key, val in data.items(): checkpoint_data[key] = val checkpoint_data[key] = data checkpoint.seek(0) checkpoint.write(dumps(checkpoint_data)) checkpoint.truncate() def close_sesh(self): self.browser.close()

This will do what you want it to do.

#export scrape = True if scrape: spider = Spider( case_type='O', year=20, bounds=range(0000, 9000), gui=False ) spider.save_response() wait = spider.WAITING_TIME

Miner

Patterns

#export """Patterns Regular expression patterns and string filtering functions implemented in the project. This file is intended to only be used as a non-executable script. TODO: Finish docs (\d{1,4}\s[\w\s]{1,20}((?:st(reet)?|ln|lane|ave(nue)?|r(?:oa)?d|highway|hwy| dr(?:ive)?|sq(uare)?|tr(?:ai)l|c(?:our)?t|parkway|pkwy|cir(cle)?|ter(?:race)?| boulevard|blvd|pl(?:ace)?)\W?(?=\s|$))(\s(apt|block|unit)\W?([A-Z]|\d+))?) """#export from re import compile as re_compile from re import I as IGNORECASE from string import punctuation#export PUNCTUATION = punctuation.replace('#', '') # all punctuations except '#' street_address = re_compile( "(" # begin regex group "\d{1,4}\s" # house number "[\w\s]{1,20}" # street name "(" # start street type group "(?:st(reet)?|ln|lane|ave(nue)?" # (st)reet, lane, ln, (ave)nue "|r(?:oa)?d|highway|hwy|dr(?:ive)?" # rd, road, hwy, highway, (dr)ive "|sq(uare)?|tr(?:ai)l|c(?:our)?t" # (sq)uare, (tr)ail, ct, court "|parkway|pkwy|cir(cle)?|ter(?:race)?" # parkway, pkwy, (cir)cle, (ter)race "|boulevard|blvd|pl(?:ace)?" # boulevard, bvld, (pl)ace "\W?(?=\s|$))" # look ahead for whitespace or end of string ")" # end street type group "(\s(apt|block|unit)(\W|#)?([\d|\D|#-|\W])+)?" # apt, block, unit number ")", # end regex group IGNORECASE # case insensitive flag ) # case insensitive delimiter for Titles TITLE_SPLIT_PAT = re_compile(" vs ", IGNORECASE) # pattern for Baltimore zip codes ZIP_STR = "2\d{4}" ZIP_PAT = re_compile(ZIP_STR) # regex pattern to capture monetary values between $0.00 and $999,999,999.99 # punctuation insensitive MONEY_STR = "\$\d{,3},?\d{,3},?\d{,3}\.?\d{2}" MONEY_PAT = re_compile(MONEY_STR) NULL_ADDR = re_compile( "^(" "(" + MONEY_STR + ")" "|(" + ZIP_STR + ")" "|(\d+)" "|(" + ZIP_STR + ".*" + MONEY_STR + ")" ")$", IGNORECASE ) STRIP_ADDR = re_compile( "(balto|" + ZIP_STR + "|md|" + MONEY_STR + ").*", IGNORECASE ) def filter_addr(address): try: return ''.join( street_address.search( address.translate( str.maketrans('','', PUNCTUATION) ) ).group(0) ) except AttributeError: return ''

Miner

#export """Miner""" from __future__ import absolute_import, print_function, unicode_literals from csv import DictWriter from json import dump, dumps, load from os import path from bs4 import BeautifulSoup from tqdm import trange # from patterns import MONEY_PAT, TITLE_SPLIT_PAT, ZIP_PAT, filter_addr # from settings import HTML_FILE, NO_CASE#export # data mining settings FEATURES = [ "Filing Date", "Case Number", "Case Type", "Title", "Plaintiff", "Defendant", "Address", "Business or Organization Name", "Party Type", ] FIELDS = FEATURES + [ "Zip Code", "Partial Cost" ] INTERNAL_FIELDS = [ "Business or Organization Name", "Party Type"] class Miner(object): def __init__(self, responses, output, debug=False): self.responses = responses self.output = output self.debug = debug self.dataset = [] self.maybe_tax = False self.features = [i + ':' for i in FEATURES] def scan_files(self): case_range = trange(len(self.responses), desc="Mining", leave=True) \ if not self.debug else range(len(self.responses)) for file_name in case_range: with open( HTML_FILE.format(case=self.responses[file_name]), 'r' ) as html_src: if not self.debug: case_range.set_description( "Mining {}".format(self.responses[file_name]) ) feature_list = self.scrape(html_src.read()) row = self.distribute(feature_list) if not row: if not path.isfile(NO_CASE): with open(NO_CASE, 'w') as no_case_file: dump([], no_case_file) with open(NO_CASE, "r+") as no_case_file: no_case_data = load(no_case_file) no_case_data.append(str(self.responses[file_name][:-5])) no_case_file.seek(0) no_case_file.write(dumps(sorted(set(no_case_data)))) no_case_file.truncate() self.dataset.extend(row) def export(self): file_exists = path.isfile(self.output) with open(self.output, 'a') as csv_file: writer = DictWriter( csv_file, fieldnames=[ col for col in FIELDS if col not in INTERNAL_FIELDS ] ) if not file_exists: writer.writeheader() for row in self.dataset: writer.writerow(row) def scrape(self, html_data): """Scrapes the desired features Args: html_data: , source HTML Returns: scraped_features: , features scraped and mapped from content """ soup = BeautifulSoup(html_data, "html.parser") # Search for the word 'tax in the document' if "tax" in soup.text.lower(): self.maybe_tax = True # Create an array from all TR's with the inner HTML for each TR # Data we want is stored inside an arbitrary # of 'span' tags inside the TR's.\ tr_list = soup.find_all("tr") # This will create an array for each TR with an array of SPAN values inside. feature_list = [] for tag in tr_list: try: # Create an innerhtml array for all spans within a single TR tag = [j.string for j in tag.findAll("span")] if set(tuple(tag)) & set(self.features): try: # Save the spans inner HTML if its not a header label tag = [i for i in tag if "(each" not in i.lower()] except AttributeError: continue feature_list.append(tag) except IndexError: continue # feature_list is an array [tr] of arrays [spans]. we want this flattened. # [tr1span1KEY, tr1span1VALUE, tr1span2KEY, tr1span2VALUE, tr2span1KEY, tr2span1VALUE, ] try: # flatten multidimensional list feature_list = [ item.replace(':', '') for sublist in feature_list for item in sublist ] except AttributeError: pass return feature_list def distribute(self, feature_list): # feature_list ~= [html][tr][spans].innterHTML # [tr1span1KEY, tr1span1VALUE, tr1span2KEY, tr1span2VALUE, tr2span1KEY, tr2span1VALUE, ] def __pair(list_type): # break up elements with n-tuples greater than 2 def __raw_business(i): return any( x in feature_list[i:i + 2][0] for x in INTERNAL_FIELDS ) def __feature_list(i): return feature_list[i:i + 2][0] in FEATURES condition = __raw_business if list_type else __feature_list # then convert list of tuples to dict for faster lookup return [ tuple(feature_list[i:i + 2]) for i in range(0, len(feature_list), 2) if condition(i) ] raw_business = __pair(1) # [(x1,y1),(x2,y2),(x3,y3)] => INTERNAL_FIELDS feature_list = dict(__pair(0)) # FEATURES filtered_business = [] # Party_Type = 'property address' not 'plaintiff' or 'defendant' # Input exists 'Business or Org Name' and == an Address for label, value in enumerate(raw_business): try: party_type = value[1].upper() section = raw_business[label + 1][0].upper() flag1 = party_type == "PROPERTY ADDRESS" flag2 = section == "BUSINESS OR ORGANIZATION NAME" if flag1 and flag2: filtered_business.append(raw_business[label + 1]) except IndexError: print("Party Type issue at Case", feature_list["Case Number"]) scraped_features = [] for address in filtered_business: str_address = filter_addr(str(address[-1])) temp_features = { key: value for key, value in feature_list.items() if key in ["Title", "Case Type", "Case Number", "Filing Date"] } if temp_features["Case Type"].upper() == "FORECLOSURE": temp_features["Case Type"] = "Mortgage" elif temp_features["Case Type"].upper() == \ "FORECLOSURE RIGHTS OF REDEMPTION" and self.maybe_tax: temp_features["Case Type"] = "Tax" else: # break out of the rest of the loop if case type is neither continue if 'Title' not in temp_features: # print('feature_list'); # print(feature_list); # print('\n \n raw_business'); # print(raw_business); continue # break up Title feature into Plaintiff and Defendant try: temp_features["Plaintiff"], temp_features["Defendant"] = \ TITLE_SPLIT_PAT.split(temp_features["Title"]) except ValueError: temp_features["Plaintiff"], temp_features["Defendant"] = \ (", ") temp_features["Address"] = \ str_address if str_address else address[-1] temp_features["Zip Code"] = ''.join(ZIP_PAT.findall(address[-1])) temp_features["Partial Cost"] = ''.join( MONEY_PAT.findall(address[-1]) ) scraped_features.append(temp_features) temp_features = {} return scraped_features

Now take each html file from the responses folder and mine them for data

#export # from settings import CHECKPOINT, HTML_DIR from os import path, remove, walk temp_output = "temp_data.csv" file_array = [filenames for (dirpath, dirnames, filenames) in walk(HTML_DIR)][0] top5 = file_array[:1]miner = Miner(file_array, temp_output)miner.scan_files()miner.export()ls"miner.output = 'tem_dat.csv'"'miner.output'"""scrapethisfile = '24O19000982.html' with open( HTML_FILE.format(case=scrapethisfile), 'r' ) as html_src: print( scrapethisfile); feature_list = scrape(html_src.read()) print(feature_list); row = distribute(feature_list) """

Manual Test

soup = '' scrapethisfile = '24O19000982.html' with open( HTML_FILE.format(case=scrapethisfile), 'r' ) as html_src: souptxt = html_src.read() soup = BeautifulSoup(souptxt, "html.parser")# from settings import CHECKPOINT, HTML_DIR from os import path, remove, walk temp_output = "temp_dat.csv" file_array = [filenames for (dirpath, dirnames, filenames) in walk(HTML_DIR)][0] top5 = [ file_array[41] ]miner = Miner(top5, temp_output)miner.scan_files()

Cleaner

#export """Cleaner This module implements post-scraping cleaning processes on the raw initial dataset. Processes include stripping excess strings off Address values, removing Zip Code and Partial Cost values mislabeled as Address, and merging rows containing blank values in alternating features. The script works as an internal module for Close Crawl, but can be executed as a standalone to manually process datasets: $ python cleaner.py """#export from __future__ import absolute_import, print_function, unicode_literals from pandas import DataFrame, concat, read_csv, to_datetime # from patterns import NULL_ADDR, STRIP_ADDR, filter_addr, punctuation#export class Cleaner(object): """Class object for cleaning the raw dataset extracted after the initial scraping """ def __init__(self, path): """Constructor for Cleaner Args: path (`str`): path to input CSV dataset Attributes: df (`pandas.core.frame.DataFrame`): initial DataFrame columns (`list` of `str`): columns of the DataFrame clean_df (`pandas.core.frame.DataFrame`): final DataFrame to be outputted """ self.df = self.prettify(read_csv(path)) self.columns = list(self.df) self.clean_df = [] @staticmethod def prettify(df, internal=True): """Drops duplicates, sorts and fills missing values in the DataFrame to make it manageable. Args: df (`pandas.core.frame.DataFrame`): DataFrame to be managed internal (`bool`, optional): flag for determining state of DataFrame Returns: df (`pandas.core.frame.DataFrame`): organized DataFrame """ df.drop_duplicates(inplace=True, keep='last', subset=["Case Number"] ) df["Filing Date"] = to_datetime(df["Filing Date"]) df.sort_values( ["Filing Date", "Case Number", "Address"], ascending=[True] * 3, inplace=True ) if internal: df["Zip Code"] = df["Zip Code"].fillna(0.0).astype(int) df["Zip Code"] = df["Zip Code"].replace(0, '') return df def clean_addr(self): """Cleans excess strings off Address values and removes Zip Code and Partial Cost values mislabeled as Address. Args: None Returns: None """ def clean_string(addr): """Applies regular expressions and other filters on Address values Args: addr (`str`): Address value to be filtered Returns: addr (`str`): filtered Address value """ # if value does not match the street_address pattern if not filter_addr(addr): # patterns.filter_addr if NULL_ADDR.sub('', addr): # value may contain valid Address return str( STRIP_ADDR.sub( '', addr) # strip off Zip Code and Partial Cost ).translate( {ord(c): None for c in punctuation} ).strip() # strip off punctuations return addr print("Cleaning addresses...", end=" ") self.df["Address"] = self.df["Address"].apply( lambda x: clean_string(x) ) self.df["Address"] = self.df["Address"].apply( lambda x: NULL_ADDR.sub('', x) ) # replace empty string values with NULL self.df["Zip Code"] = self.df["Zip Code"].replace('', float("nan")) self.df["Address"] = self.df["Address"].replace('', float("nan")) print("Done") @staticmethod def combine_rows(row): """Merges rows after filtering out common values Args: row (`list` of `list` of `str`): groupby("Case Number") rows Returns: (`list` of `str`): merged row """ def __filter_tuple(col): """Filters common values from rows Args: col (`tuple` of `str`): values per column Returns: value (`str`): common value found per mergeable rows """ for value in set(col): if value == value: # equivalent to value != NaN return value return [__filter_tuple(x) for x in zip(*row)] @staticmethod def mergeable(bool_vec): """Determines if groupby("Case Number") rows are mergeable Example: bool_vec = [ [True, True, True, True, True, True, False, True, True], [True, True, True, True, True, True, True, False, False], [True, True, True, True, True, True, False, False, False] ] __sum_col(bool_vec) -> [3, 3, 3, 3, 3, 3, 1, 1, 1] __bool_pat(__sum_col(bool_vec)) -> True Args: bool_vec (`list` of `bool`): represents non-NULL values Returns: (`bool`): True if rows are mergeable """ def __sum_col(): """Sums columns Args: None Returns: (`list` of `int`): sum of columns """ return [sum(x) for x in zip(*bool_vec)] def __bool_pat(row): """Determines mergeability Args: None Returns: (`bool`): True if rows are mergeable """ return set(row[-3:]) == set([1]) and set(row[:-3]) != set([1]) return True if __bool_pat(__sum_col()) else False def merge_nulls(self): """Splits DataFrames into those with NULL values to be merged, and then later merged with the original DataFrame Args: None Returns: None """ print("Merging rows...", end=" ") print(self.df) # filter out rows with any NULL values origin_df = self.df.dropna() # filter out rows only with NULL values null_df = self.df[self.df.isnull().any(axis=1)] # boolean representation of the DataFrame with NULL values bool_df = null_df.notnull() # (`list` of `dict` of `str` : `str`) to be converted to a DataFrame new_df = [] for i in null_df["Case Number"].unique(): bool_row = bool_df[null_df["Case Number"] == i] new_row = null_df[null_df["Case Number"] == i] # if the rows are mergeable, combine them if self.mergeable(bool_row.values): new_row = self.combine_rows(new_row.values.tolist()) new_df.append( { feature: value for feature, value in zip(self.columns, new_row) } ) # else, treat them individually else: new_row = new_row.values.tolist() for row in new_row: new_df.append( { feature: value for feature, value in zip(self.columns, row) } ) # merge the DataFrames back self.clean_df = concat( [origin_df, DataFrame(new_df)] ).reset_index(drop=True) # prettify the new DataFrame self.clean_df = self.prettify( self.clean_df[self.columns], internal=False ) print("Done") def init_clean(self): """Initializes cleaning process Args: None Returns: None """ print(self.df) self.clean_addr() print(self.df) self.merge_nulls() def download(self, output_name): """Downloads the cleaned and manipulated DataFrame into a CSV file Args: output_name (`str`): path of the new output file Returns: None """ self.clean_df.to_csv(output_name, index=False) #export temp_output = "temp_data.csv" output = 'outputfile.csv' df_obj = Cleaner(temp_output) df_obj.init_clean() df_obj.download(output)

Main

#export """main The main executable script for Close Crawl. This file manages types, flags and constraints for the case type, year and output data file. Usage: $ python main.py Example usage: $ python main.py O 2015 test_set.csv -l=300 -u=600 -d=1 """#export from __future__ import absolute_import, print_function, unicode_literals from json import dump, dumps, load from os import path, remove, walk from shutil import rmtree from time import time from cleaner import Cleaner from miner import Miner from settings import CHECKPOINT, HTML_DIR from spider import Spider#export def close_crawl(case_type, case_year, output, cases='', lower_bound=0, upper_bound=0, debug=False, scrape=True, mine=True, clean=True): """Main function for Close Crawl. Args: case_type (`str`): type of foreclosure case, options are 'O' and 'C' case_year (`str`): year of foreclosure cases output (`str`): path of the output CSV file, along with the valid extension (.csv) lower_bound (`int`, optional): lower bound of range of cases upper_bound (`int`, optional): upper bound of range of cases debug (`bool`, optional): option for switching between debug mode. Default -> True Returns: None """ temp_output = "temp_data.csv" wait = 0 case_list = [] print('checkpoint') if not path.isfile(CHECKPOINT): print("Initializing project...") with open(CHECKPOINT, "w") as checkpoint: dump( { "last_case": "{:04d}".format(int(str(lower_bound)[-4:])), "type": case_type, "year": case_year[-2:], "error_case": '', }, checkpoint ) print('cases') if not cases: with open(CHECKPOINT) as checkpoint: prev_bound = int(load(checkpoint)["last_case"]) if not lower_bound: lower_bound = prev_bound upper_bound = upper_bound if int(upper_bound) > int(lower_bound) \ else str(lower_bound + 5) case_list = range(int(lower_bound), int(upper_bound) + 1) else: with open(cases) as manual_cases: case_list = sorted(list(set(load(manual_cases)))) print('scrape') if scrape: spider = Spider( case_type=case_type, year=case_year[-2:], bounds=case_list, gui=False ) spider.save_response() wait = spider.WAITING_TIME file_array = [filenames for (dirpath, dirnames, filenames) in walk(HTML_DIR)][0] start_mine = time() print('mine') if mine: miner = Miner(file_array, temp_output) miner.scan_files() miner.export() print('clean') if clean: df_obj = Cleaner(temp_output) df_obj.init_clean() df_obj.download(output) print('save') with open(CHECKPOINT, "r+") as checkpoint: checkpoint_data = load(checkpoint) checkpoint_data["last_case"] = sorted(file_array)[-1].split('.')[0][-4:] checkpoint.seek(0) checkpoint.write(dumps(checkpoint_data)) checkpoint.truncate() print("Crawling runtime: {0:.2f} s".format((end_crawl - start_crawl))) print( "Downloading runtime: {0:.2f} s".format( ((end_crawl - start_crawl) - wait)) ) print("Mining runtime: {0:.2f} s".format((end_mine - start_mine))) print("Program runtime: {0:.2f} s".format((end - start))) print("------------ SCRAPING COMPLETED ------------") #export args = {'type': 'C', 'year': '2014', 'output': 'outputfile.csv', 'file': '', 'lower': '1000', 'upper': '2000', 'debug': '0', 'scrape': True, 'mine': True, 'clean': True} close_crawl( case_type=args["type"], case_year=args["year"], output=args["output"], cases=args["file"], lower_bound=args["lower"], upper_bound=args["upper"], debug=args["debug"], scrape=args["scrape"], mine=args["mine"], clean=args["clean"] )

CLI

lscd ../! python cliargs.py -l=50 -u=3500 -d -s -m C 2016 output.csvls#export """cliargs The main command line script for Close Crawl. This file manages types, flags and constraints for the case type, year and output data file as well as the processing options. Parameters: {O,C} | Type of foreclosure cases year | Year of foreclosure cases output | Path of output file Optional parameters: -h, --help | Show this help message and exit -v, --version| Show program's version number and exit -l, --lower | Lower bound of range of cases -u, --upper | Upper bound of range of cases -f, --file | Path of JSON array of cases -d, --debug | Debug mode -s, --scrape | Scrape only -m, --mine | Mine only -c, --clean | Clean only Usage: $ python cliarg.py [-h] [-v] [-l] [-u] [-f] [-d] [-s] [-m] [-c] {O,C} year output Example usages: $ python cliarg.py -l=50 -u=3500 -d -s -m C 2016 output.csv $ python cliarg.py -c="cases_to_scrape.json" -d O 2014 output01.csv """#export from __future__ import absolute_import, print_function, unicode_literals import sys from textwrap import dedent from modules import main#export #!/usr/bin/env python # -*- coding: utf-8 -*- if __name__ == "__main__": args = {} args["type"] = input("Enter type of case (1 char: {C, O}): ") args["year"] = input("Enter year of case (4 digit int): ") args["output"] = input("Enter name of output file (CSV file path): ") opt = int(input("Enter 0 for manual parameters or 1 for automatic: ")) if bool(opt): args["file"] = input("Enter name of cases file (JSON file path): ") args["lower"] = args["upper"] = 0 else: args["file"] = "" args["lower"] = input("Enter lower bound of cases (1-4 digit int): ") args["upper"] = input("Enter upper bound of cases (1-4 digit int): ") args["debug"] = input( "Enter 0 for default mode, 1 for debug (1 digit int): " ) print( dedent( """Processing options:\n\n""" """For the following options, enter 0 to disable or 1 to enable.""" """\nNOTE: The script will exit if all but the mining step is """ """enabled - data cannot be cleaned without being mined first.""" ) ) args["scrape"] = bool(input("Scrape: {0, 1}: ")) args["mine"] = bool(input("Mine: {0, 1}: ")) args["clean"] = bool(input("Clean: {0, 1}: ")) # exit script if all but the mining step is enabled if (args["scrape"] and not(args["mine"]) and args["clean"]): sys.exit("\nData cannot be cleaned without being mined first.") print (args) main.close_crawl( case_type=args["type"], case_year=args["year"], output=args["output"], cases=args["file"], lower_bound=args["lower"], upper_bound=args["upper"], debug=args["debug"], scrape=args["scrape"], mine=args["mine"], clean=args["clean"] )

Test

ls! python cliargs.py -h! python cliargs.py -d -m -c C 2019 output.csv! python cliargs.py -l=9500 -u=9999 -c -s -d -m O 2019 output.csvls

if both packages are in your import path (sys.path), and the module/class you want is in example/example.py, then to access the class without relative import try: from example.example import fkt

if name == 'main': from mymodule import as_int else: from .mymodule import as_int

https://stackoverflow.com/questions/448271/what-is-init-py-for

Files named init.py are used to mark directories on disk as Python package directories. If you have the files

mydir/spam/init.py mydir/spam/module.py and mydir is on your path, you can import the code in module.py as

import spam.module or

from spam import module

SEARCH

CONNECT WITH US

DONATE

Help us keep this resource free and available to the public. Donate now!

Donate to BNIA-JFI

CONTACT US

Baltimore Neighborhood Indicators Alliance
The Jacob France Institute
1420 N. Charles Street, Baltimore, MD 21201
410-837-4377 | bnia-jfi@ubalt.edu