6.3. Ancillary tools¶
These fetch useful data from elsewhere, for use by CRATE.
6.3.1. crate_postcodes¶
Options as of 2017-02-28:
usage: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
[--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
[--startswith STARTSWITH [STARTSWITH ...]] [--replace]
[--skiplookup]
[--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]]
[--list_lookup_tables] [--skippostcodes] [--docsonly]
[-v]
- This program reads data from the UK Office of National Statistics Postcode
Database (ONSPD) and inserts it into a database.
- You will need to download the ONSPD from
https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
Tell this program which directory you used.
- Specify your database as an SQLAlchemy connection URL: see
http://docs.sqlalchemy.org/en/latest/core/engines.html
The general format is:
dialect[+driver]://username:password@host[:port]/database[?key=value...]
- If you get an error like:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
position 33: ordinal not in range(256)
then try appending "?charset=utf8" to the connection URL.
- ONS POSTCODE DATABASE LICENSE.
Output using this program must add the following attribution statements:
Contains OS data © Crown copyright and database right [year]
Contains Royal Mail data © Royal Mail copyright and database right [year]
Contains National Statistics data © Crown copyright and database right [year]
See http://www.ons.gov.uk/methodology/geography/licences
optional arguments:
-h, --help show this help message and exit
--dir DIR Root directory of unzipped ONSPD download (default:
/home/rudolf/dev/onspd)
--url URL SQLAlchemy database URL
--echo Echo SQL
--reportevery REPORTEVERY
Report every n rows (default: 1000)
--commitevery COMMITEVERY
Commit every n rows (default: 10000). If you make this
too large (relative e.g. to your MySQL
max_allowed_packet setting, you may get crashes with
errors like 'MySQL has gone away'.
--startswith STARTSWITH [STARTSWITH ...]
Restrict to postcodes that start with one of these
strings
--replace Replace tables even if they exist (default: skip
existing tables)
--skiplookup Skip generation of code lookup tables
--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]
Within the lookup tables, process only specific named
tables
--list_lookup_tables List all possible lookup tables, then stop
--skippostcodes Skip generation of main (large) postcode table
--docsonly Show help for postcode table then stop
-v, --verbose Verbose
6.3.2. crate_fetch_wordlists¶
This tool assists in fetching common word lists, such as name lists for global blacklisting, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.
Options as of 2018-03-27:
usage: crate_fetch_wordlists [-h] [--specimen] [--verbose]
[--min_word_length MIN_WORD_LENGTH]
[--show_rejects] [--english_words]
[--english_words_output ENGLISH_WORDS_OUTPUT]
[--english_words_url ENGLISH_WORDS_URL]
[--valid_word_regex VALID_WORD_REGEX]
[--us_forenames]
[--us_forenames_url US_FORENAMES_URL]
[--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
[--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
[--us_forenames_output US_FORENAMES_OUTPUT]
[--us_surnames]
[--us_surnames_output US_SURNAMES_OUTPUT]
[--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
[--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
[--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
[--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
[--eponyms] [--eponyms_output EPONYMS_OUTPUT]
[--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
[--filter_input [FILTER_INPUT [FILTER_INPUT ...]]]
[--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]]
[--filter_output [FILTER_OUTPUT]]
optional arguments:
-h, --help show this help message and exit
--specimen Show some specimen usages and exit (default: False)
--verbose, -v Be verbose (default: False)
--min_word_length MIN_WORD_LENGTH
Minimum word length to allow (default: 2)
--show_rejects Print to stdout (and, in verbose mode, log) the words
being rejected (default: False)
English words:
--english_words Fetch English words (for reducing nonspecific
blacklist, not as whitelist; consider words like
smith) (default: False)
--english_words_output ENGLISH_WORDS_OUTPUT
Output file for English words (default:
english_words.txt)
--english_words_url ENGLISH_WORDS_URL
URL for a textfile containing all English words (will
then be filtered) (default: https://www.gutenberg.org/
files/3201/files/CROSSWD.TXT)
--valid_word_regex VALID_WORD_REGEX
Regular expression to determine valid English words
(default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)
US forenames:
--us_forenames Fetch US forenames (for blacklist) (default: False)
--us_forenames_url US_FORENAMES_URL
URL to Zip file of US Census-derived forenames lists
(excludes names with national frequency <5; see
https://www.ssa.gov/OACT/babynames/limits.html)
(default:
https://www.ssa.gov/OACT/babynames/names.zip)
--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was at least
this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Higher numbers are more
comprehensive but slower.) (default: 0)
--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was less than
or equal to this value. Range is 0-100. Use 100 for no
limit. Setting this below 100 excludes RARE names.
(This is a trade-off between being comprehensive and
operating at a reasonable speed. Higher numbers are
more comprehensive but slower.) (default: 100)
--us_forenames_output US_FORENAMES_OUTPUT
Output file for US forenames (default:
us_forenames.txt)
US surnames:
--us_surnames Fetch US surnames (for blacklist) (default: False)
--us_surnames_output US_SURNAMES_OUTPUT
Output file for UK surnames (default: us_surnames.txt)
--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
URL for textfile of US 1990 Census surnames (default:
http://www2.census.gov/topics/genealogy/1990surnames/d
ist.all.last)
--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
URL for zip of US 2010 Census surnames (default: https
://www2.census.gov/topics/genealogy/2010surnames/names
.zip)
--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was at least
this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Higher numbers are more
comprehensive but slower.) (default: 0)
--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was less than
or equal to this value. Range is 0-100. Use 100 for no
limit. Setting this below 100 excludes RARE names.
(This is a trade-off between being comprehensive and
operating at a reasonable speed. Higher numbers are
more comprehensive but slower.) (default: 100)
Medical eponyms:
--eponyms Write medical eponyms (to remove from blacklist)
(default: False)
--eponyms_output EPONYMS_OUTPUT
Output file for medical eponyms (default:
medical_eponyms.txt)
--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
Add unaccented versions (e.g. Sjogren as well as
Sjögren) (default: True)
Filter functions:
Extra functions to filter wordlists. Specify an input file (or files),
whose lines will be included; optional exclusion file(s), whose lines will
be excluded (in case-insensitive fashion); and an output file. You can use
'-' for the output file to mean 'stdout', and for one input file to mean
'stdin'. No filenames (other than '-' for input and output) may overlap.
The --min_line_length option also applies. Duplicates are not removed.
--filter_input [FILTER_INPUT [FILTER_INPUT ...]]
Input file(s). See above. (default: None)
--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]
Exclusion file(s). See above. (default: None)
--filter_output [FILTER_OUTPUT]
Exclusion file(s). See above. (default: None)
Specimen usage:
#!/bin/bash
# -----------------------------------------------------------------------------
# Specimen usage under Linux
# -----------------------------------------------------------------------------
cd ~/Documents/code/crate/working
# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip
crate_fetch_wordlists --help
crate_fetch_wordlists \
--english_words \
--english_words_url file://$PWD/dictionary.txt \
--us_forenames \
--us_forenames_url file://$PWD/forenames.zip \
--us_forenames_max_cumfreq_pct 100 \
--us_surnames \
--us_surnames_1990_census_url file://$PWD/surnames_1990.txt \
--us_surnames_2010_census_url file://$PWD/surnames_2010.zip \
--us_surnames_max_cumfreq_pct 100 \
--eponyms
# --show_rejects \
# --verbose
# Forenames encompassing the top 95% gives 5874 forenames (of 96174).
# Surnames encompassing the top 85% gives 74525 surnames (of 175880).
crate_fetch_wordlists \
--filter_input \
us_forenames.txt \
us_surnames.txt \
--filter_exclude \
english_words.txt \
medical_eponyms.txt \
--filter_output \
filtered_names.txt