Ingredients¶
Ingredients are the building block of recipe.
Ingredients can contain columns that are part of the SELECT
portion of a query,
filters that are part of a WHERE
clause of a query, group_bys that
contribute to a query’s GROUP BY
and havings which add HAVING
limits
ot a query.
Creating ingredients in python¶
Ingredients can be created either in python or via configuration. To created Ingredients in python, use one of the four convenience classes.
- Metric: Create an aggregated calculation using a column. This value appears only in the SELECT part of the SQL statement.
- Dimension: Create a non-aggregated value using a column. This value appears in the SELECT and GROUP BY parts of the SQL statement.
- Filter: Create a boolean expression. This value appears in the
WHERE part of the SQL statement. Filters can be created automatically
using the AutomaticFilters extension or by using a Dimension or Metric’sales
build_filter
method. - Having: Create a boolean expression with an aggregated ColumnElement. This value appears in the HAVING part of the SQL statement.
Metrics and Dimensions are commonly reused in working Recipe code, while filters are often created temporarily based on data.
Features of ingredients¶
Let’s explore some capabilities.
Formatters¶
Formatters are a list of python callables that take a single value. This
let you manipulate the results of an ingredient with python code. If you use
formatters, the original, unmodified value is available as {ingredient}_raw
.
shelf = Shelf({
'state': Dimension(Census.state),
'age': WtdAvgMetric(Census.age, Census.pop2000),
'gender': Dimension(Census.gender),
'population': Metric(func.sum(Census.pop2000), formatters=[
lambda value: int(round(value, -6) / 1000000)
])
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('gender').metrics('population')
for row in recipe.all():
print('{} has {} people'.format(row.gender, row.population))
print('\tThe original value is: {}'.format(row.population_raw))
The results look like
Building filters¶
Ingredient.build_filter
Storing extra attributes in meta¶
Extra keyword arguments that get passed to ingredient initialization
get stored in the meta
object. This can be used to extend the
capabilities of ingredients and add extra features.
d = Dimension(Census.age, icon='cog')
print(d.meta.icon)
>>> 'cog'
Types of Ingredients¶
List of ingredients
Dimension¶
Dimensions are groupings that exist in your data. Dimension objects add the column to the select statement and the group by of the SQL query.
# A simple dimension
self.shelf['state'] = Dimension(Census.state)
Adding an id¶
Dimensions can use separate columns for ids and values. Consider a
table of employees with an employee_id
and a full_name
. If you had
two employees with the same name you need to be able to distinguish between
them.
# Support an id and a label
self.shelf['employee']: Dimension(Employee.full_name,
id_expression=Employee.id)
The id is accessible as employee_id
in each row and their full name is
available as employee
.
Using lookups¶
Lookup maps values in your data to descriptive names. The _id
property of your dimension contains the original value.
# Convert M/F into Male/Female
self.shelf['gender']: Dimension(Census.sex, lookup={'M': 'Male',
'F': 'Female'}, lookup_default='Unknown')
If you use the gender dimension, there will be a gender_id
in each row
that will be “M” or “F” and a gender
in each row that will be “Male” or
“Female”.
shelf = Shelf({
'state': Dimension(Census.state),
'gender_desc': Dimension(Census.gender, lookup={'M': 'Male',
'F': 'Female'}, lookup_default='Unknown'),
'age': WtdAvgMetric(Census.age, Census.pop2000),
'population': Metric(func.sum(Census.pop2000))
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('gender_desc').metrics('population')
print(recipe.to_sql())
print(recipe.dataset.csv)
Lookups inject a formatter in the first position. Because a formatter
is used, recipe creates a gender_desc_raw
on the response that
contains the unformatted value then uses the lookup to create the gender_desc
property. All dimensions also generate an {ingredient}_id
property.
Here is the query and the results.
SELECT census.gender AS gender_desc_raw,
sum(census.pop2000) AS population
FROM census
GROUP BY census.gender
gender_desc_raw,population,gender_desc,gender_desc_id
F,143534804,Female,F
M,137392517,Male,M
Metric¶
Metrics are aggregations performed on your data. Here’s an example of a few Metrics.
shelf = Shelf({
'total_population': Metric(func.sum(Census.pop2000)),
'min_population': Metric(func.min(Census.pop2000)),
'max_population': Metric(func.max(Census.pop2000))
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.metrics('total_population', 'min_population', 'max_population')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results of this recipe are:
DivideMetric¶
Division in SQL introduces the possibility of division by zero. DivideMetric guards against division by zero while giving you a quick way to divide one calculation by another.
shelf = Shelf({
'state': Dimension(Census.state),
'popgrowth': DivideMetric(func.sum(Census.pop2008-Census.pop2000), func.sum(Census.pop2000)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state').metrics('popgrowth')
This creates results like:
The denominator has a tiny value added to it to prevent division by zero.
WtdAvgMetric¶
WtdAvgMetric
generates a weighted average of a number using a weighting.
Warning
WtdAvgMetric
takes two ColumnElements as arguments. The first is the value
and the second is the weighting. Unlike other Metrics, these are not aggregated.
Here’s an example.
shelf = Shelf({
'state': Dimension(Census.state),
'avgage': WtdAvgMetric(Census.age, Census.pop2000),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state').metrics('avgage')
print(recipe.to_sql())
print(recipe.dataset.csv)
This generates results that look like this:
Note: WtdAvgMetric uses safe division from DivideMetric
.
Filter¶
Filter objects add a condition to the where clause of your SQL query. Filter objects can be added to a Shelf.
shelf = Shelf({
'state': Dimension(Census.state),
'population': Metric(func.sum(Census.pop2000)),
'teens': Filter(Census.age.between(13,19)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('population')\
.filters('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
This results in output like:
Different ways of generating Filters¶
Recipe has several ways of filtering recipes.
Filter objects can be added to the shelf. They can be added to the recipe by name from a shelf. This is best when you have a filter that you want to use in many place.
shelf = Shelf({ 'age': Dimension(Census.age), 'state': Dimension(Census.state), 'population': Metric(func.sum(Census.pop2000)), 'teens': Filter(Census.age.between(13,19)), }) ... recipe = recipe.filters('teens')
Filter objects can be created dynamically and added to the recipe. This is best if the filtering needs to change dynamically.
recipe = recipe.filters(Filter(Census.age.between(13,19))
Ingredient.build_filter can be used to build filters that refer to the ingredient’s column.
age_filter = shelf['age'].build_filter([13,19], 'between') recipe = recipe.filters(age_filter)
This is best when you want to reuse a column definition defined in an ingredient.
AutomaticFilters: The AutomaticFilters extension adds filtering syntax directly to recipe.
recipe = recipe.automatic_filters({ 'age__between': [13,19] })
This is best when you want to add many filters consistently. AutomaticFilters uses
Ingredient.build_filter
behind the scenes.
Having¶
Having objects are binary expressions with an aggregated column value.
One easy way to generate Having
objects is to build_filter
using
a Metric
.
shelf = Shelf({
'age': Dimension(Census.age),
'avgage': WtdAvgMetric(Census.age, Census.pop2000),
'state': Dimension(Census.state),
'population': Metric(func.sum(Census.pop2000)),
})
# Find states with a population greater than 15 million
big_states = shelf['population'].build_filter(15000000, operator='gt')
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('population')\
.order_by('-population')\
.filters(big_states)
print(recipe.to_sql())
print(recipe.dataset.csv)
This generates the following results.