Comparison types, transform add-ons, aggregate features, and household aggregate features¶
This page has information on the different comparison types available for the [[comparison_features]]
section, along with some attributes available to all of the comparison types and some aggregate features
that are not configurable.
Comparison types¶
Each header below represents a comparison type. Transforms are used in the context of comparison_features
.
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
maximum_jaro_winkler¶
Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns. For example, given an input of column_names = ['namefrst', 'namelast']
, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons:
[('namefrst_a', 'namefrst_b'),
('namefrst_a', 'namelast_b'),
('namelast_a', 'namefrst_b'),
('namelast_a', 'namelast_b')]
Attributes:
column_names
– Type: list of strings. Required. The list of columns used as input for the set of comparisons generated by taking the cartesian product.
[[comparison_features]]
alias = "maximum_jw"
column_names = ["namelast", "namefrst"]
comparison_type = "maximum_jaro_winkler"
jaro_winkler¶
Returns the Jaro-Winkler comparison score for a given column.
Attributes:
column_name
– Type:string
. Required. The column to compare using the Jaro-Winkler score.
[[comparison_features]]
alias = "namefrst_jw"
column_name = "namefrst"
comparison_type = "jaro_winkler
jaro_winkler_street¶
Uses an additional geographic column value to filter for major location changes before comparing street names. If boundary column A is not equal to boundary column B, a Jaro-Winkler score of zero is returned. If boundary column A and B are equal, the Jaro-Winkler comparison score of the street columns is returned.
Attributes:
column_name
– Type:string
. Required. The input street column.boundary
– Type:string
. Required. An input column to match on before comparing street name values.
[[comparison_features]]
alias = "jw_street"
column_name = "street"
boundary = "enum_dist"
comparison_type = "jaro_winkler_street"
max_jaro_winkler¶
Returns the greatest Jaro-Winkler value from the comparisons of a list of names.
Attributes:
column_name
– Type:string
. Required. Input column containing a list of names to compare (such as related household members, or neighborhood surnames).
[[comparison_features]]
alias = "related_individual_max_jw"
column_name= "namefrst_related"
comparison_type = "max_jaro_winkler"
equals¶
Asserts that values are the same for both compared columns using SQL: a.{column_name} IS NOT DISTINCT FROM b.{column_name}
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
f1_match¶
Evaluates if the first name initial A matches either the first name first initial B or either the first or second middle initial of B. If so, returns 1. Otherwise, returns 2.
1 = First initial of first first name A matches first initial of any of potential match first names B
2 = mismatch
Uses the following SQL query:
"CASE WHEN (
(a.{fi} IS NOT DISTINCT FROM b.{fi}) OR
(a.{fi} IS NOT DISTINCT FROM b.{mi0}) OR
(a.{fi} IS NOT DISTINCT FROM b.{mi1})
) THEN 1 ELSE 2 END"
[[comparison_features]]
alias = "f1_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f1_match"
categorical = true
f2_match¶
Evaluates if first middle initial A is empty/null. If so, return 0. Otherwise, if either first or second middle initial A is not null and matches first name initial B, or first or second middle initial B, return 1. Otherwise, return 2.
1 = First initial of A second first name matches first initial of any of potential match first names B
2 = mismatch
0 = no second first name A
Uses the following SQL:
CASE WHEN ((a.{mi0} == '') OR (a.{mi0} IS NULL)) THEN 0 WHEN (
(a.{mi0} IS NOT DISTINCT FROM b.{fi}) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{fi})) OR
(a.{mi0} IS NOT DISTINCT FROM b.{mi0}) OR
(a.{mi0} IS NOT DISTINCT FROM b.{mi1}) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi0})) OR
((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi1}))
) THEN 1 ELSE 2 END
Attributes:
first_init_col
– Type:string
. Required. First name initial input column.mid_init_cols
– Type: list of strings. Required. List of first and second middle initial input columns.
[[comparison_features]]
alias = "f2_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f2_match"
categorical = true
not_equals¶
Asserts that values are distinct between compared individuals using SQL: a.{column_name} IS DISTINCT FROM b.{column_name}
. Used mainly in caution flag features (f_caution, m_caution, sp_caution).
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
equals_as_int¶
Checks for equality using equals sign and returns boolean result in integer form. Uses SQL: CAST(a.{col} = b.{col} as INT)
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "namelast_equal_as_int"
column_name = "namelast_clean"
comparison_type = "equals_as_int"
all_equals¶
Asserts whether the values in all given columns match. Uses a SQL expression generated by joining a.{col} = b.{col}
and AND
for each given column.
Attributes:
column_names
– Type: list of strings. Required. List of the columns to evaluate if all are equal across records being compared.
[[comparison_features]]
alias = "exact"
column_names = ["namefrst_unstd", "namelast_clean"]
comparison_type = "all_equals"
not_zero_and_not_equals¶
Checks that both values are present (not null) and nonzero and that they are not equal to one another. Evaluates to a boolean. This is primarily useful when a value of 0 indicates some kind of incomparibility akin to the value being missing.
See also present_and_equal_categorical_in_universe, which is a similar but more general comparison type.
Attributes:
column_name
– Type:string
. Required. Input column to compare.
[[comparison_features]]
alias = "fbpl_nomatch"
column_name = "fbpl"
comparison_type = "not_zero_and_not_equals"
or¶
Allows for the concatenation of up to four comparison features into one feature using a SQL OR
between the generated clause for each sub-comparison.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_c
,comp_d
– Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr"]
comparison_type = "or"
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
lower_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
and¶
Allows for the concatenation of up to four comparison features into one feature using a SQL AND
between the generated clause for each sub-comparison.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_c
,comp_d
– Type: Object. Optional. Sub-comparison using any of the comparison feature types documented in this section.
In this example, the and
comparison appears in [comparison_features.comp_b]
.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
times¶
Takes the output of two sub-comparisons and multiplies them together after casting as floats.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.comp_a
andcomp_b
can also have sub-comparisons, as in the given example.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
caution_comp_3¶
Generates an SQL expression in the form (comparison A OR comparison B) AND comparison C
.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_3"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
caution_comp_3_012¶
Similar to caution_comp_3
, but first checks the value of comparison C. If comparison C evaluates to false,
then caution_comp_3_012
evaluates to 2. Otherwise, it evaluates to the result of caution_comp_3
, so 0 or 1.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
– Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.
caution_comp_4¶
Generates an SQL expression in the form (comparison A OR comparison B OR comparison C) AND comparison D
.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
,comp_d
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
caution_comp_4_012¶
Similar to caution_comp_4
, but first checks the value of comparison D. If comparison D evaluates to false,
then caution_comp_4_012
evaluates to 2. Otherwise, it evaluates to the result of caution_comp_4
, so 0 or 1.
Attributes:
column_names
– Type: list of strings. Required. A list of all input columns used by sub-comparisons.comp_a
,comp_b
,comp_c
,comp_d
– Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.
any_equals¶
Used to compare middle initials and first names under specific circumstances.
If middle initial A is not empty/null and is the same as either middle initial B or first name B,
OR if first name A is not empty/null and is the same as middle initial B.
Attributes:
column_names
– Type: list of strings. Required. The first input column should be the middle initial column, and the second input column should be the first name column.
[[comparison_features]]
alias = "mid_init_match"
column_names = ["namefrst_mid_init", "namefrst_unstd"]
comparison_type = "any_equals"
either_are_1¶
Checks if the column value for either A or B is equal to 1.
Attributes:
column_name
– Type:string
. Required. Input column to compare to 1.
[[comparison_features]]
alias = "either_1"
column_name = "nativity"
comparison_type = "either_are_1"
categorical = true
either_are_0¶
Checks if the column value for either A or B is equal to 0.
Attributes:
column_name
– Type:string
. Required. Input column to compare to 0.
[[comparison_features]]
alias = "either_0"
column_name = "nativity"
comparison_type = "either_are_0"
categorical = true
second_gen_imm¶
Checks if individual A is a second-generation immigrant by looking for nativity
value of 2, 3, or 4 (one or both parents foreign-born).
Attributes:
column_name
– Type:string
. Required. Input should be the name of the nativity column.
[[comparison_features]]
alias = "sgen"
column_name = "nativity"
comparison_type = "second_gen_imm"
categorical = true
rel_jaro_winkler¶
Uses a Scala function to determine the number of people in the input column with a name similarity score (Jaro-Winkler) greater than or equal to the given jw_threshold
, an age difference less than or equal to the given age_threshold
, and matching sex for the sample A individual and the sample B potential match. Takes a column generated with the feature selection transform related_individual_rows
as input (list of person data objects to compare). Can be used for related or unrelated individuals, depending on the input column specified.
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects.name_col
– Type:string
.birthyr_col
– Type:string
.jw_threshold
– Type:float
.age_threshold
– Type:int
.
[[comparison_features]]
alias = "rel"
column_name = "namefrst_related_rows"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
comparison_type = "rel_jaro_winkler"
jw_threshold = 0.9
age_threshold = 5
extra_children¶
Using a Scala function, checks to see if there are children present in sample B who are not present in sample A, but based on relate codes, age, sex, and name, we would have expected to be present in A. Returns a count of suspected “extra” children. Takes a column generated with the feature selection transform related_individual_rows
as input (list of person data objects to compare).
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects.relate_col
– Type:string
. The name of the column with therelate
code.histid_col
– Type:string
. The name of the id column.name_col
– Type:string
. The name of the column containing the first name for comparison.birthyr_col
– Type:string
. The name of the column containing the birth year.year_b
– Type:int
. The year that sample B was taken.jw_threshold
– Type:float
. The minimum acceptable Jaro-Winkler score to consider a match.age_threshold
– Type:int
. The maximum acceptable age difference to consider a match.
[[comparison_features]]
alias = "extra_children"
column_name = "namefrst_related_rows"
relate_col = "relate"
histid_col = "histid"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
year_b = 1910
comparison_type = "extra_children"
jw_threshold = 0.8
age_threshold = 2
jaro_winkler_rate¶
Uses a Scala function to calculate the percentage of individuals who have a Jaro-Winkler score greater than or equal to the given threshold. Rate returned as a percentage as a float data type.
Attributes:
column_name
– Type:string
. The input column with data in the form of a list of person data objects. The input column seen below (”namelast_neighbors”)was generated using a “neighbor_aggregate” feature selection.jw_threshold
– Type:float
. The minimum Jaro-Winkler threshold to consider an acceptable match.
In the following example, a lower_threshold
feature add-on is used to convert the returned rate to a boolean asserting whether it meets the given minimum threshold. (>= 5% of neighbors have a Jaro-Winkler score >= 0.95)
[[comparison_features]]
alias = "nbors"
comparison_type = "times"
column_names = ["namelast_neighbors", "county", "statefip"]
[comparison_features.comp_a]
column_name = "namelast_neighbors"
comparison_type = "jaro_winkler_rate"
jw_threshold = 0.95
lower_threshold = 0.05
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
sum¶
Adds the column values for A and B together (takes the sum).
Attributes:
column_name
– Type:string
. The input column to be added.
[[comparison_features]]
alias = "namelast_popularity_sum"
column_name = "namelast_popularity"
comparison_type = "sum"
length_b¶
Returns the length of the column value in record B using the SQL size()
function.
Attributes:
column_name
– Type:string
. The name of the input column to take the length of in dataset B.
abs_diff¶
Takes the absolute value of the difference between the values of the given column in datasets A and B.
Attributes:
column_name
– Type:string
. The input column to evaluate.not_equals
– Type:int
. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.
[[comparison_features]]
alias = "byrdiff"
column_name = "replaced_birthyr"
comparison_type = "abs_diff"
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "abs_diff"
btwn_threshold = [9, 14]
categorical = True
b_minus_a¶
Returns the value of subtracting the value of column A from the value of column B.
Attributes:
column_name
– Type:string
. The input column to evaluate.not_equals
– Type:int
. OPTIONAL. You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference. For example, if you are evaluating the difference in marriage duration values, and “99” is a placeholder value for “unknown” in the data, you can exclude those values from consideration using this attribute.
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
geo_distance¶
Uses a lookup table to find the geographic distance between locations. The SQL expression is generated by hlink/linking/core/dist_table.py
. There are several ways to configure this feature. You can look up distances in the given file using one or two keys (specified with the key_count
attribute). You can also optionally have a secondary look-up table that serves as a back-up value in the case that the primary look-up does not contain a value for the locations given. This is particularly useful for county distance, as you can set the primary join to be across counties, but set up a secondary join on state, which has much fewer combinations and thus less risk of nulls, to fill in when the counties specified aren’t in the look-up.
Attributes:
key_count
– Type:int
. The number of keys used to join on the primary (or only) look-up table. Acceptable values are 1 or 2. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b.distances_file
– Type:string
of path. Path to the distances look-up file.table_name
– Type:string
. What to name the table that will be generated from the distances file. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.Attributes for
key_count = 1
:column_name
– Type:string
. The column in the input data that you want to use as a key to look up the geographic distance.loc_a
– Type:string
. First column to join on in the look-up table (where to find the value coming from thecolumn_name
column A).loc_b
– Type:string
. Second column to join on in the look-up table (where to find the value coming from thecolumn_name
column B).distance_col
– Type:string
. Name of the column containing the geographic distance in the look-up table.
Attributes for
key_count = 2
:column_names
– Type: list of strings. The two columns you want to use as keys to look up the geographic distance.source_column_a
– Type:string
. First column to join on in the source data.source_column_b
– Type:string
. Second column to join on in the source data.loc_a_0
– Type:string
. First column to join on in the look-up table.loc_a_1
– Type:string
. First column to join on in the look-up table.loc_b_0
– Type:string
. Second column to join on in the look-up table.loc_b_1
– Type:string
. Second column to join on in the look-up table.distance_col
– Type:string
. Name of the column containing the geographic distance in the look-up table.
Attributes if using a secondary join:
secondary_key_count
– Type:int
. The number of keys used to join on the secondary (backup) look-up table. Acceptable values are 1 or 2.secondary_table_name
– Type:string
. What to name the table that will be generated from thesecondary_distances_file
. If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.secondary_distances_file
– Type:string
of path. Path to the secondary distances look-up file.secondary_source_column
– Type:string
. The column in the input data that you want to use as a key in the secondary geographic distance look-up.secondary_loc_a
– Type:string
. First column to join on in the secondary look-up table.secondary_loc_b
– Type:string
. Second column to join on in the secondary look-up table.secondary_distance_col
– Type:string
. Name of the column containing the geographic distance in the secondary look-up table.
[[comparison_features]]
alias = "state_distance"
comparison_type = "geo_distance"
key_count = 1
table_name = "state_distance_lookup"
distances_file = "/path/to/county_state_distance.csv"
column_name = "bpl"
loc_a = "statecode1"
loc_b = "statecode2"
distance_col = "dist"
[[comparison_features]]
alias = "county_distance"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
key_count = 2
table_name = "county_distance_lookup"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
fetch_a¶
Gets the value of column A.
Attributes:
column_name
– Type:string
. Required. The column to get the value from.
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_a"
categorical = true
fetch_b¶
Gets the value of column B.
Attributes:
column_name
– Type:string
. The column to get the value from.
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_b"
categorical = true
present_both_years¶
Checks whether column A and column B are both present and both positive (> 0). Evaluates to 1 if both are present and positive and 0 otherwise.
Attributes:
column_name
– Type:string
. The column to check. Must be a column with a numerical type.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
neither_are_null¶
Checks that neither column A nor column B is null or the empty string ''
.
Evaluates to 1 if neither column is null or ''
and evaluates to 0 otherwise.
Attributes:
column_name
– Type:string
. The column of type string to check.
[[comparison_features]]
alias = "mpres"
column_name = "m_namefrst"
comparison_type = "neither_are_null"
categorical = true
present_and_matching_categorical¶
Checks that both column A and column B are present and that they match according to SQL’s IS DISTINCT FROM
. Evaluates to 0, 1, or 2:
0 → columns are both present and match
1 → columns are both present but are distinct
2 → one or both columns are missing
Attributes:
column_name
– Type:string
. Required. The column to check.
present_and_not_equal¶
Checks that column A and column B are both present but are not equal.
Attributes:
column_name
– Type:string
. The column to check.
present_and_equal_categorical_in_universe¶
Checks that column A and column B are both present, are not equal to the not-in-universe value NIU,
and are equal to each other according to SQL’s IS DISTINCT FROM
. Evaluates to 0 if either column is missing or
if either column is the NIU value. Otherwise, evaluates to 0 if the columns are distinct or 1 if
the columns are equal.
Attributes:
column_name
– Type:string
. Required. The column to check.NIU
– Type: same as the type of the input column. Required. The not-in-universe value to use in the check.
[[comparison_features]]
alias = "mfbplmatch"
column_name = "nativity"
comparison_type = "present_and_equal_categorical_in_universe"
NIU = "0"
categorical = true
sql_condition¶
This is a flexible comparison type that allows users to write their own SQL expressions to be evaluated. Favor using a different comparison type if that’s a reasonable option. If there are no other comparison types that work for a particular use case, this one is a good fallback.
Attributes:
column_names
– Type: list of strings. Required. A list of all columns used in the SQL expression.condition
– Type:string
. The SQL expression to evaluate.
In this example, we make use of the hlink-defined jw
function, which computes
the Jaro-Winkler similarity of two strings. nvl
is a Spark builtin function
which returns its second argument if the first is null, and the first argument
otherwise.
[[comparison_features]]
alias = "namelast_jw_max"
comparison_type = "sql_condition"
column_names = ["namelast1", "namelast2", "namelast3"]
condition = "GREATEST(jw(nvl(a.namelast1, ''), nvl(b.namelast1, '')), jw(nvl(a.namelast2, ''), nvl(b.namelast2, '')), jw(nvl(a.namelast3, ''), nvl(b.namelast3, '')))"
Feature add-ons¶
These attributes can be added to most comparison feature types above to extend the type of output returned beyond the standard comparison feature.
alias¶
Attributes:
alias
: Type:string
. Should be used at the top level comparison of every comparison feature. The name for the output column.
[[comparison_features]]
alias = "jw_f"
column_name = "father_namefrst"
comparison_type = "jaro_winkler"
power¶
Raises a comparison feature to a given exponential power.
Attributes:
power
– Type:int
. The power to raise the comparison output to. For example,power = 2
will square the output.
[[comparison_features]]
alias = "county_distance_squared"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
# PRIMARY JOIN
# key count: the number of keys used for the join per source file. Ex: for state and county, key_count = 2. For just state, key_count = 1 even though there is county_a and county_b
key_count = 2
table_name = "county_distance_lookup"
#distances_file = "/path/to/county_state_distance.csv"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
power = 2
threshold¶
Attributes:
threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=
) the given threshold value.
[[comparison_features]]
alias = "imm"
column_name = "nativity"
comparison_type = "fetch_a"
threshold = 5
categorical = true
lower_threshold¶
Attributes:
lower_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (>=
) the given threshold value.
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
upper_threshold¶
Attributes:
upper_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is less than or equal to (<=
) the given threshold value.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
gt_threshold¶
Attributes:
gt_threshold
– Type: numeric types. Asserts if the comparison feature output is not null and is greater than (>
) the given threshold value.
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
btwn_threshold¶
Attributes:
btwn_threshold
– Type: List of numeric type. Asserts if the comparison feature output is greater than or equal to (>=
) the first threshold value, and less than or equal to (<=
) the second threshold value.
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
look_at_addl_var¶
Attributes:
look_at_addl_var
– Type: boolean. Flags the program to consider an additional column value before reporting the comparison feature value.addl_var
– Type:string
. The additional column to consider.check_val_expr
– Type: expression. The expression to use to evaluate the additional column. For example,check_val_expr = "= 5"
.else_val
– Type: same type as comparison feature output. If the additional volumn value does not meet thecheck_val_expr
specification, the value to return instead of the comparison feature value.
In the following example, the generated SQL expression for the column would be: CASE WHEN {datasource}.nativity = 5 then {yrimmig abs_diff value} else -1 END
.
[[comparison_features]]
alias = "immyear_diff"
column_name = "yrimmig"
comparison_type = "abs_diff"
look_at_addl_var = true
addl_var = "nativity"
datasource = "a"
check_val_expr = "= 5"
else_val = -1
Aggregate Features¶
These features are not configurable. To include them in the generated comparison features, they just need to be included in the [training][independent_vars]
section of the config. They are generated using the “aggregate_features” SQL template.
hits¶
The number of potential matches generated for the given individual (counted by aggregating on {id_column}_a
).
hits2¶
hits
squared.
exact_mult¶
Indicator for the existence of multiple potential matches with the exact same first and last name as the A sample individual within the B data. Returns numeric boolean (0 or 1).
Household Aggregate Features¶
These features are not configurable. To include them in the generated comparison features, they just need to be included in the [hh_training][independent_vars]
section of the config. They are generated using the “hh_aggregate_features” SQL template.
jw_max_a¶
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0
if no other individuals are in the household for comparison.
jw_max_b¶
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where sex matches and birth year difference is less than or equal to ten, excluding the individual A in the current potential match. Returns 0
if no other individuals are in the household for comparison.