Defining Shelves from parsed configuration¶
Parsed config allows defining fields and conditions using natural, SQL-like language.
Defining Shelves¶
Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example (configured in yaml) looks like this.
_version: "2"
total_population:
kind: Metric
field: pop2000
state:
kind: Dimension
field: state
See examples for more Shelf examples.
Defining Ingredients¶
Ingredients are defined using parsed_fields (which may contain parsed_conditions_).
Metric¶
Metrics will always apply a default aggregation of ‘sum’ to any fields used.
kind: Metric
field: {field}
Dimension¶
Metrics will always apply a default aggregation of ‘sum’ to their field.
kind: Dimension
field: {field}
{role}_field: {field} (optional)
buckets: A list of labeled conditions (optional)
buckets_default_label: string (optional)
quickselects: A list of labeled conditions (optional)
Adding an id and other roles to Dimension¶
Dimensions can be defined with extra fields. The prefix before _field
is the field’s role. The role will be suffixed to each value in the
recipe rows. Let’s look at an example.
hospital:
field: hospital_name
id_field: hospital_id
latitude_field: hospital_lat
longitude_field: hospital_lng
Each result row will include
hospital
hospital_id
The field defined asid_field
hospital_latitude
The field defined aslatitude_field
hospital_longitude
The field defined aslongitude_field
Defining buckets¶
Buckets let you group continuous values (like salaries or ages). Here’s an example:
groups:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
condition: 'state in ["Vermont", "New Hampshire"]'
- label: 'babies'
condition: '<2'
- label: 'children'
condition: '<13'
- label: 'teens'
condition: '<20'
buckets_default_label: 'oldsters'
The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.
For convenience, conditions defined in buckets will use the field from the Dimension
unless a different field is defined in the condition. In the example above, the first
bucket uses field: state
explicitly while all the other conditions use field: age
from the Dimension.
The bucket Dimension will include an order_by that orders results in the order that the buckets were defined.
Adding quickselects to a Dimension¶
quickselects are a way of associating conditions with a dimension.
region:
kind: Dimension
field: sales_region
total_sales:
kind: Metric
field: sales_dollars
date:
kind: Dimension
field: sales_date
quickselects:
- label: 'Last 90 days'
condition: 'between "90 days ago" and "tomorrow"'
- label: 'Last 180 days'
condition: 'between "180 days ago" and "tomorrow"'
These conditions can then be accessed through Ingredient.build_filter
.
The AutomaticFilters
extension is an easy way to use this.
recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
.dimensions('region') \
.metrics('total_sales') \
.automatic_filters({
'date__quickselect': 'Last 90 days'
})
Defining Fields¶
Fields are defined using strings. When used in a Metric
, the field may contain
aggregations. If not aggregation is provided, the entire field string will be wrapped
in a sum()
.
When used in a Dimension
, fields must not contain aggregations. An BadIngredient
exception will be raised if you define a field this way.
Here are some examples of non-aggregated fields that you could use in a Dimension
.
Field |
Description |
---|---|
student:
kind: Dimension
field: student_name
|
Use the column student_name in your selectable. |
student:
kind: Dimension
field: student_name
id_field: student_id
|
Use the column student_name in your selectable as the value for the field and uses the student_id column as the id. |
student:
kind: Dimension
field: 'student_first_name + " " + student_last_name'
id_field: student_id
|
Concatenate the student first and last names as the value for the field and uses the student_id column as the id. |
Here’s an example of some aggregated fields that you could use in metrics
Warning
The following two fields are for internal use.
Key |
Required |
Description |
---|---|---|
ref |
optional |
string Replace this field with the field defined in the specified key in the shelf. |
_use_raw_value |
optional |
boolean Don’t evaluate value as a column, treat it as a constant in the SQL expression. |
Defining Fields with Strings¶
Fields can be defined using strings. When using strings, words are treated as column references. If the words are prefixed with an ‘@’ (like @sales), the field of the ingredient named sales in the shelf will be injected.
Aggregations can be called like functions to apply that aggregation to a column.
Field string |
Description |
---|---|
revenue - expenses |
The sum of column revenue minus the sum of column expenses. field: revenue - expenses
# is the same as
field:
value: revenue
aggregation: sum # this may be omitted because 'sum'
# is the default aggregation for Metrics
operators:
- operator: '-'
field:
value: expenses
aggregation: sum
|
@sales / @student_count |
Find the field definition of the field named ‘sales’ in the shelf. Divide it by the field definition of the field named ‘student_count’. |
count_distinct(student_id) |
Count the distinct values of column student_id. field: count_distinct(student_id)
# is the same as
field:
value: student_id
aggregation: count_distinct
|
Defining Field Operators¶
Operators lets you perform math with fields.
Key |
Required |
Description |
---|---|---|
operator |
required |
string One of ‘+’, ‘-‘, ‘*’, ‘/’ |
field |
required |
A field definition (either a string or a dictionary) |
For instance, operators can be used like this:
# profit - taxes - interest
field:
value: profit
operators:
- operator: '-'
field: taxes
- operator: '-'
field: interest
Defining Conditions¶
Conditions can include a field and operator or a list of conditions and-ed or or-ed together.
field: {field definition}
label: string (an optional string label)
{operator}: {value} or {list of values}
or
or: # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}
or
and: # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}
or
a condition reference @{ingredient name from the shelf}.
Conditions consist of a field and exactly one operator.
Condition |
Value is… |
Description |
---|---|---|
gt |
A string, int, or float. |
Find values that are greater than the value For example: # Sales dollars are greater than 100.
condition:
field: sales_dollars
gt: 100
|
gte (or ge) |
A string, int, or float. |
Find values that are greater than or equal to the value |
lt |
A string, int, or float. |
Find values that are less than the value |
lte (or le) |
A string, int, or float. |
Find values that are less than or equal to the value |
eq |
A string, int, or float. |
Find values that are equal to the value |
ne |
A string, int, or float. |
Find values that are not equal to the value |
like |
A string |
Find values that match the SQL LIKE expression For example: # States that start with the capital letter C
condition:
field: state
like: 'C%'
|
ilike |
A string |
Find values that match the SQL ILIKE (case insensitive like) expression. |
between |
A list of two values |
Find values that are between the two values. |
in |
A list of values |
Find values that are in the list of values |
notin |
A list of values |
Find values that are not in the list of values |
ands and ors in conditions¶
Conditions can and
and or
a list of conditions together.
Here’s an example:
# Find states that start with 'C' and end with 'a'
# Note the conditions in the list don't have to
# use the same field.
condition:
and:
- field: state
like: 'C%'
- field: state
like: '%a'
Date conditions¶
If the field
is a date or datetime, absolute and relative dates
can be defined in values using string syntax. Recipe uses the
Dateparser library.
Here’s an example.
# Find sales that occured within the last 90 days.
condition:
field: sales_date
between:
- '90 days ago'
- 'tomorrow'
Labeled conditions¶
Conditions may optionally be labeled by adding a label property.
quickselects are a feature of Dimension that are defined with a list of labeled conditions.
Examples¶
A simple shelf with conditions¶
This shelf is basic.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results look like:
SELECT census.state AS state,
sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS teens
FROM census
GROUP BY census.state
state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...
Metrics referencing other metric definitions¶
The following shelf has a Metric pct_teens
that divides one previously defined Metric
teens
by another total_pop
.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
total_pop:
kind: Metric
field: pop2000
pct_teens:
field: '@teens'
divide_by: '@total_pop'
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note that recipe performs safe division.
SELECT census.state AS state,
CAST(sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state
state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...
Dimensions containing buckets¶
Dimensions may be created by bucketing a field.
total_pop:
kind: Metric
field: pop2000
age_buckets:
kind: Dimension
field: age
buckets:
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
mixed_buckets:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
in: ['Vermont', 'New Hampshire']
field: state
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('mixed_buckets')\
.metrics('total_pop')\
.order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note this recipe orders by mixed_buckets
. The buckets are
ordered in the order they are defined.
SELECT CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END AS mixed_buckets,
sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END
ORDER BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 0
WHEN (census.age < 2) THEN 1
WHEN (census.age < 13) THEN 2
WHEN (census.age < 20) THEN 3
ELSE 9999
END
mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters