Defining Shelves with Field Expressions¶
Shelves can be created using dictionaries containing keys and ingredient definitions. The shelf configuration can then be bound to a SQLAlchemy selectable. The best way of doing this is to use the field expression syntax. All the examples below use YAML to define a python dictionary.
Defining Shelves using Field Expressions¶
A simple example looks like this.
total_population:
kind: Metric
field: sum(pop2000)
state:
kind: Dimension
field: state
See expression_examples for more Shelf examples.
Defining Ingredients¶
Ingredients are defined using fields_ defined in expression syntax.
Defining Metrics¶
Metrics will always apply a default aggregation of ‘sum’ to any fields used. The “Measure” can be used as a synonym of “Metric”.
kind: Metric
field: {field}
The field expression can use functions to perform aggregation. If no function is provided then the field will be summed by default.
Math and functions on fields¶
Fields can be added together and be wrapped in functions.
Type |
Function |
Description |
---|---|---|
field |
{field}+{field} |
Add two fields together |
field |
{field}-{field} |
Subtract a field from a field |
field |
{field}*{field} |
Multiply two fields |
field |
{field}/{field} |
Divide fields. Note This is a SQL safe division. Division by zero returns null. |
field |
sum({field}) |
Sum up the values of {field} |
field |
min({field}) |
Calculate the minumum value of {field} |
field |
max({field}) |
Calculate the maximum value of {field} |
field |
avg({field}) |
Calculate the average value of {field} |
field |
median({field}) |
Calculate the median value of {field}. Note This aggregation is not available on all databases. |
field |
percentile<n>({field}) |
Calculate the nth percentile value {field} <n> is one of 1,5,10,25,50,75,90,95,99 Note This aggregation is not available on all databases. |
field |
count({field}) |
Count the number of values |
field |
count_distinct({field}) |
Count the number of distinct values of {field} |
field |
month({date_field}) |
Round to the nearest month for dates |
field |
week({date_field}) |
Round to the nearest week for dates |
field |
year({date_field}) |
Round to the nearest year for dates |
field |
quarter({date_field}) |
Round to the nearest quarter for dates |
field |
age({date_field}) |
Calculate current age in years for a date. |
These functions and math can be combined. Division will be performed safely to ensure that division by zero is not performed. Here’s an example:
avg_profit_per_facility:
kind: Metric
field: sum(sales - expenses) / count(facilities)
Defining contant values and lists of values¶
Values are numbers, strings or dates that can be used anywhere a field is.
Type |
Examples |
Description |
---|---|---|
Strings |
"STRING"
"This is a string"
|
Strings are defined by double quoting. |
Numbers |
1
1.525
1234890
|
Numbers can be integers or floating point values. |
Dates and times |
"2016-02-20"
"December 2019"
"5 days ago"
|
Recipe uses dateparser to evaluate dates. Both absolute dates and relative dates can be defined. |
Lists of values |
(1, 2, 3, 4, 5)
(3.14, 2.72)
("apple", "peach")
|
Lists of values are comma separated within parentheses. All values should be the same type, but Recipe does not validate this. |
Values can be used in field math. Here are some examples:
avg_population:
kind: Metric
field: sum(population_in_2010 + population_in_2020) / 2.0
tax_paid:
kind: Metric
field: sum(sales)*0.0725
Defining true and false conditions¶
Conditions can be used to calculate true or false values.
Type |
Function |
Description |
---|---|---|
condition |
{field} = {field}|{value} |
Is a field equal to a field or a value |
condition |
{field} != {field}|{value} |
Is a field not equal to a field or a value |
condition |
{field} > {field}|{value} |
Is a field greater than a field or a value |
condition |
{field} >= {field}|{value} |
Is a field greater than or equal to a field or a value |
condition |
{field} < {field}|{value} |
Is a field less than a field or a value |
condition |
{field} <= {field}|{value} |
Is a field less than or equal to a field or a value |
condition |
{field} IN ({list}) |
Is a field in a comma separate list of fields or values. |
condition |
{field} NOT IN ({list}) |
Is a field not in a comma separate list of fields or values. |
condition |
{field} BETWEEN {value} AND {value} |
Is a field between two values. |
condition |
{condition} AND {condition} |
Are both expressions true. |
condition |
{condition} OR {condition} |
Is either condition true. |
Using conditions and fields with the IF
function¶
The IF
function lets you combine conditions.
if({condition}, {field}, {else_field})
If the condition is true, use {{field}}
otherwise use {{else_field}}.
More than one condition and field pair can can be provided.
if({condition1}, {field1}, {condition2}, {field2}, {else_field})
Let’s look at an example. Here is how to sum up sales_dollars
in the
last week.
sales_in_last_week:
kind: Metric
field: sum(if(sales_date>"7 days ago",sales_dollars,0.0))
Metrics must aggregate¶
Metrics must define an aggregated field. If a Metric definition does not
include an aggregation function, it will be wrapped in a sum()
.
Defining Dimensions¶
Dimensions are simple to define but include a number of optional features.
kind: Dimension
field: {field}
{role}_field: {field} (optional)
buckets: A list of labeled conditions (optional)
buckets_default_label: string (optional)
quickselects: A list of labeled conditions (optional)
Defining simple dimensions¶
Dimensions can be use fields, expressions, conditions and even the IF
function as long as they do not use aggregation functions. Here are some
examples.
hospital:
kind: Dimension
field: hospital_name
student:
kind: Dimension
field: student_last_name
student_full_name:
kind: Dimension
field: student_first_name + " " + student_last_name
new_york_hospitals:
kind: Dimension
field: IF(state="New York",hospital_name,"Other")
Adding id
and other roles to a Dimension¶
Dimensions can be defined with extra fields. The prefix before _field
is the field’s role. The role will be suffixed to each value in the
recipe rows. Let’s look at an example.
hospital:
field: hospital_name
id_field: hospital_id
latitude_field: hospital_lat
longitude_field: hospital_lng
Each result row will include
hospital
hospital_id
The field defined asid_field
hospital_latitude
The field defined aslatitude_field
hospital_longitude
The field defined aslongitude_field
Defining buckets¶
Buckets let you group continuous values (like salaries or ages) into a dimension. Here’s an example:
groups:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
field: state
in: ['Vermont', 'New Hampshire']
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.
For convenience, conditions defined in buckets will use the field from the Dimension
unless a different field is defined in the condition. In the example above, the first
bucket uses field: state
explicitly while all the other conditions use field: age
from the Dimension.
If you use order_by a bucket dimension, the order will be the order in which the buckets were defined.
Adding quickselects to a Dimension¶
quickselects are a way of associating conditions with a dimension.
region:
kind: Dimension
field: sales_region
total_sales:
kind: Metric
field: sales_dollars
date:
kind: Dimension
field: sales_date
quickselects:
- label: 'Last 90 days'
between:
- 90 days ago
- tomorrow
- label: 'Last 180 days'
between:
- 180 days ago
- tomorrow
These conditions can then be accessed through Ingredient.build_filter
.
The AutomaticFilters
extension is an easy way to use this.
recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
.dimensions('region') \
.metrics('total_sales') \
.automatic_filters({
'date__quickselect': 'Last 90 days'
})
Examples¶
A simple shelf with conditions¶
This shelf is basic.
teens:
kind: Metric
field: sum(if(age
field:
value: pop2000
condition:
field: age
between: [13,19]
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results look like:
SELECT census.state AS state,
sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS teens
FROM census
GROUP BY census.state
state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...
Metrics referencing other metric definitions¶
The following shelf has a Metric pct_teens
that divides one previously defined Metric
teens
by another total_pop
.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
total_pop:
kind: Metric
field: pop2000
pct_teens:
field: '@teens'
divide_by: '@total_pop'
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note that recipe performs safe division.
SELECT census.state AS state,
CAST(sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state
state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...
Dimensions containing buckets¶
Dimensions may be created by bucketing a field.
total_pop:
kind: Metric
field: pop2000
age_buckets:
kind: Dimension
field: age
buckets:
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
mixed_buckets:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
in: ['Vermont', 'New Hampshire']
field: state
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('mixed_buckets')\
.metrics('total_pop')\
.order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note this recipe orders by mixed_buckets
. The buckets are
ordered in the order they are defined.
SELECT CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END AS mixed_buckets,
sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END
ORDER BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 0
WHEN (census.age < 2) THEN 1
WHEN (census.age < 13) THEN 2
WHEN (census.age < 20) THEN 3
ELSE 9999
END
mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters