Defining Shelves using configuration¶
Configuration lets you define fields and conditions using natural, SQL-like language.
Note
An older version of defining shelves from config can be found at Defining Shelves from configuration (the old way).
Defining Shelves¶
Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example (configured in yaml) looks like this.
_version: "2"
total_population:
kind: Metric
field: pop2000
state:
kind: Dimension
field: state
The _version: “2” key is necessary to trigger the new shelf behavior.
See examples for more Shelf examples.
Fields¶
The equivalent of the SQLAlchemy expression
used in Ingredients defined in Python
is field
. This is a string that will be parsed into a SQLAlchemy expression
using a selectable (a table, recipe or subquery used to fetch data).
Fields are defined using strings.
When used in a Metric
, the field may contain
aggregations. If not aggregation is provided, the entire field string will be wrapped
in a sum()
.
When used in a Dimension
, fields must not contain aggregations. An BadIngredient
exception will be raised if you define a field this way.
Here are some examples of non-aggregated fields that you could use in a Dimension
.
Description |
Definition |
---|---|
Use the column student_name in your selectable. |
student:
kind: Dimension
field: student_name
|
Use the column student_name in your selectable as the value for the field and uses the student_id column as the id. |
student:
kind: Dimension
field: student_name
id_field: student_id
|
Concatenate the student first and last names as the value for the field and uses the student_id column as the id. |
student:
kind: Dimension
field: 'student_first_name + " " + student_last_name'
id_field: student_id
|
Here’s an example of some aggregated fields that you could use in metrics
Description |
Definition |
---|---|
Count the number of rows in your data |
count:
kind: Metric
field: count(*)
|
Count the number of distinct student names. |
student_cnt:
kind: Metric
field: count_distinct(student_name)
|
Sum the value in the sales column in your selectable. |
total_sales:
kind: Metric
field: sum(sales)
|
Sum the value in the sales column and subtract the sum of expenses in your selectable. |
profit:
kind: Metric
field: sum(sales) - sum(expenses)
|
Aggregations are written function-style like sum(sales)
. The following aggregations are available:
sum(<field>)
min(<field>)
max(<field>)
avg(<field>)
count(<field>)
count_distinct(<field>)
month(<field>) (round to the nearest month for dates)
week(<field>) (round to the nearest week for dates)
year(<field>) (round to the nearest year for dates)
quarter(<field>) (round to the nearest quarter for dates)
age(<field>) (calculate age based on a date and the current date)
none(<field>) (perform no aggregation)
median(<field>) (calculate the median value, note: this aggregation is not available on all databases).
percentile[1,5,10,25,50,75,90,95,99](<field>) (calculate the nth percentile value where higher values correspond to higher percentiles, note: this aggregation is not available on all databases).
Defining if-then logic in fields¶
Fields can contain an if()
function which contains one or more conditions. It
looks like this.
if(<condition>, <field>, [<condition>, <field>,] [<else_field>])
Here’s some examples:
Description |
Definition |
---|---|
Count alerts if a certain status_code is matched |
alert_cnt:
kind: Metric
field: count_distinct(if(status_code=5, alert_id))
|
Discount sales based on codes, but sum without a discount when the right code doesn’t exist. |
discount_total:
kind: Metric
field: sum(if(discount_code=1,sales*0.9,discount_code=2,sales*0.8,sales)
|
Discount sales based on codes, but sum without a discount when the right code doesn’t exist. |
discount_total:
kind: Dimension
field: if(last_name,first_name + " " + last_name,first_name)
|
Conditions¶
Conditions are expressions that evaluate as true or false.
Condition |
Description |
---|---|
> |
Find values that are greater than the value For example: # Sales dollars are greater than 100.
condition: sales_dollars>100
or # Sales dollars are greater than 100.
condition: last_name>"C"
|
>= |
Find values that are greater than or equal to the value |
< |
Find values that are less than the value |
<= |
Find values that are less than or equal to the value |
= |
Find values that are equal to the value |
!= |
Find values that are not equal to the value |
between <value> and <value> |
Find values that are between the two values. # Sales dollars are between than 100 and 200.
condition: sales between 100 and 200
or # Sales dollars are between than 100 and 200.
condition: 'sales_date between "2 weeks ago" and "tomorrow"'
|
in (list of <values>) |
Find values that are in the list of values # New England states in the USA
condition: state_abbreviation in ("VT", "NH", "ME", "MA", "CT")
|
not in |
Find values that are not in the list of values condition: sales_code not in (1,5,7,9)
|
Using ands and ors in conditions¶
Conditions can and
and or
multiple conditions together.
Here’s an example:
# Find sales between 100 and 1000
condition: sales_dollars > 100 and sales_dollars < 1000
You can also use parentheses to clearly express groupings.
# Find sales meeting multiple conditions
condition: (sales_dollars > 100 or sales_date > "1 month ago") and region = "North"
Date conditions¶
If the field
is a date or datetime, absolute and relative dates
can be defined in values using string syntax. Recipe uses the
Dateparser library.
Here’s an example.
# Find sales that occured within the last 90 days.
condition: 'sales_date between "90 days ago" and "tomorrow"'
Partial conditions¶
While most conditions have to contain a field, condition and value (like
sales_dollars>1000
), in some contexts you can define a partial condition that
contains just the condition and value (>1000
). The field will be automatically
prefixed to each partial condition.
Extra features¶
Metric fields always apply an aggregation¶
Metrics will always apply a default aggregation of ‘sum’ to any fields used.
sales:
kind: Metric
field: sales_dollars
is the same as
sales:
kind: Metric
field: sum(sales_dollars)
Defining extra roles in dimensions¶
Dimensions can contain extra groupings (see Adding additional groupings). In configuration
you can define extra roles by creating extra keys that end with _field
. For instance:
student:
kind: Dimension
field: 'student_first_name + " " + student_last_name'
id_field: student_id
Defining bucket dimensions¶
A common need is to group values and treat those groupings as a dimension. For instance, you could group sales as small, medium or large.
Dimension allows you to define a list of labeled conditions that you can use to do exactly this. Let’s look at an example then break it down.
kind: Dimension
field: sales_dollars
buckets:
- label: Small
condition: <1000
- label: Medium
condition: <20000
- label: Large
condition: >=20000
buckets_default_label: Unknown
These conditions can be full or partial conditions (Partial conditions). In this
example the sales_dollars
would be prefixed to all conditions, making it
identical to this.
kind: Dimension
field: sales_dollars
buckets:
- label: Small
condition: sales_dollars<1000
- label: Medium
condition: sales_dollars<20000
- label: Large
condition: sales_dollars>=20000
buckets_default_label: Unknown
The buckets_default_label
is applied when none of the bucket conditions match
(for instance, if the sales_dollars was NULL in this example). A bucket Dimension will
include an order_by that orders results in the order that the buckets were defined.
Note
Buckets create a if()
function to create their groupings
In our sample bucket code, we could accomplish the same thing with these fields (broken into separate lines for clarity).
kind: Dimension
field: 'if(sales_dollars<1000,"Small",
sales_dollars<20000,"Medium",
sales_dollars>=20000,"Large","Unknown")'
order_by_field: 'if(sales_dollars<1000,1,
sales_dollars<20000,2,
sales_dollars>=20000,3,9999)'
Adding quickselects to a Dimension¶
quickselects are a way of associating named conditions with a Dimension. Like buckets quickselects use partial conditions.
region:
kind: Dimension
field: sales_region
total_sales:
kind: Metric
field: sales_dollars
date:
kind: Dimension
field: sales_date
quickselects:
- name: 'Last 90 days'
condition: 'between "90 days ago" and "tomorrow"'
- name: 'Last 180 days'
condition: 'between "180 days ago" and "tomorrow"'
These conditions can then be accessed through Ingredient.build_filter
.
The AutomaticFilters
extension is an easy way to use this.
recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
.dimensions('region') \
.metrics('total_sales') \
.automatic_filters({
'date__quickselect': 'Last 90 days'
})
Examples¶
A simple shelf with conditions¶
This shelf is basic.
_version: "2"
teens:
kind: Metric
field: if(age between 13 and 19,pop2000)
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results look like:
SELECT census.state AS state,
sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS teens
FROM census
GROUP BY census.state
state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...
Metrics referencing other metric definitions¶
The following shelf has a Metric pct_teens
that divides one previously defined Metric
teens
by another total_pop
.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
total_pop:
kind: Metric
field: pop2000
pct_teens:
field: '@teens'
divide_by: '@total_pop'
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note that recipe performs safe division.
SELECT census.state AS state,
CAST(sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state
state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...
Dimensions containing buckets¶
Dimensions may be created by bucketing a field.
total_pop:
kind: Metric
field: pop2000
age_buckets:
kind: Dimension
field: age
buckets:
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
mixed_buckets:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
in: ['Vermont', 'New Hampshire']
field: state
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('mixed_buckets')\
.metrics('total_pop')\
.order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note this recipe orders by mixed_buckets
. The buckets are
ordered in the order they are defined.
SELECT CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END AS mixed_buckets,
sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END
ORDER BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 0
WHEN (census.age < 2) THEN 1
WHEN (census.age < 13) THEN 2
WHEN (census.age < 20) THEN 3
ELSE 9999
END
mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters