Defining Shelves from configuration

Shelves are defined as dictionaries containing keys and ingredient. All the examples below use YAML.

Defining Shelves

Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example looks like this.

total_population:
  kind: Metric
  field: pop2000
state:
  kind: Dimension
  field: state

See examples for more Shelf examples.

Defining Ingredients

Ingredients are defined using fields (which may contain conditions). Those conditions may reference more fields in turn and so forth.

Metric

Metrics will always apply a default aggregation of ‘sum’ to any fields used.

divide_by is an optional denominator that field will be divided by safely.

Dimension

Metrics will always apply a default aggregation of ‘sum’ to their field.

Adding id and other roles to Dimension

Dimensions can be defined with extra fields. The prefix before _field is the field’s role. The role will be suffixed to each value in the recipe rows. Let’s look at an example.

Each result row will include

  • hospital
  • hospital_id The field defined as id_field
  • hospital_latitude The field defined as latitude_field
  • hospital_longitude The field defined as longitude_field

Defining buckets

Buckets let you group continuous values (like salaries or ages). Here’s an example:

groups:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      field: state
      in: ['Vermont', 'New Hampshire']
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.

For convenience, conditions defined in buckets will use the field from the Dimension unless a different field is defined in the condition. In the example above, the first bucket uses field: state explicitly while all the other conditions use field: age from the Dimension.

If you use order_by a bucket dimension, the order will be the order in which the buckets were defined.

Adding quickfilters to a Dimension

Quickfilters are a way of associating conditions with a dimension.

region:
    kind: Dimension
    field: sales_region
total_sales:
    kind: Metric
    field: sales_dollars
date:
    kind: Dimension
    field: sales_date
    quickfilters:
    - label: 'Last 90 days'
      between:
      - 90 days ago
      - tomorrow
    - label: 'Last 180 days'
      between:
      - 180 days ago
      - tomorrow

These conditions can then be accessed through Ingredient.build_filter. The AutomaticFilters extension is an easy way to use this.

recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
            .dimensions('region') \
            .metrics('total_sales') \
            .automatic_filters({
              'date__quickfilter': 'Last 90 days'
            })

Defining Fields

Fields can be defined with a short string syntax or a dictionary syntax. The string syntax always is normalized into the dictionary syntax.

Defining Fields with Dicts

Dictionaries provide access to all options when defining a field.

dictionary field options
Key Required Description
value required

string

What column to use.

aggregation optional

string

(default is ‘sum’ for Metric and ‘none’ for Dimension)

What aggregation to use, if any. Possible aggregations are:

  • ‘sum’
  • ‘min’
  • ‘max’
  • ‘avg’
  • ‘count’
  • ‘count_distinct’
  • ‘month’ (round to the nearest month for dates)
  • ‘week’ (round to the nearest week for dates)
  • ‘year’ (round to the nearest year for dates)
  • ‘quarter’ (round to the nearest quarter for dates)
  • ‘age’ (calculate age based on a date and the current date)
  • ‘none’ (perform no aggregation)
condition optional

A condition

Condition will limit what rows of data are aggregated for a field.

operators optional

A list of operator

Operators are fields combined with a math operator to the base field.

default optional

An integer, string, float, or boolean value (optional)

A value to use if the column is NULL.

Warning

The following two fields are for internal use.

internal dictionary field options
Key Required Description
ref optional

string

Replace this field with the field defined in the specified key in the shelf.

_use_raw_value optional

boolean

Don’t evaluate value as a column, treat it as a constant in the SQL expression.

Defining Fields with Strings

Fields can be defined using strings. When using strings, words are treated as column references. If the words are prefixed with an ‘@’ (like @sales), the field of the ingredient named sales in the shelf will be injected.

Aggregations can be called like functions to apply that aggregation to a column.

string field examples
Field string Description
revenue - expenses The sum of column revenue minus the sum of column expenses.
@sales / @student_count

Find the field definition of the field named ‘sales’ in the shelf.

Divide it by the field definition of the field named ‘student_count’.

count_distinct(student_id) Count the distinct values of column student_id.

Defining Field Operators

Operators lets you perform math with fields.

operator options
Key Required Description
operator required

string

One of ‘+’, ‘-‘, ‘*’, ‘/’

field required A field definition (either a string or a dictionary)

For instance, operators can be used like this:

# profit - taxes - interest
field:
  value: profit
  operators:
  - operator: '-'
    field: taxes
  - operator: '-'
    field: interest

Defining Conditions

Conditions can include a field and operator or a list of conditions and-ed or or-ed together.

Conditions consist of a field and exactly one operator.

condition options
Condition Value is… Description
gt A string, int, or float.

Find values that are greater than the value

For example:

gte (or ge) A string, int, or float. Find values that are greater than or equal to the value
lt A string, int, or float. Find values that are less than the value
lte (or le) A string, int, or float. Find values that are less than or equal to the value
eq A string, int, or float. Find values that are equal to the value
ne A string, int, or float. Find values that are not equal to the value
like A string

Find values that match the SQL LIKE expression

For example:

ilike A string Find values that match the SQL ILIKE (case insensitive like) expression.
between A list of two values Find values that are between the two values.
in A list of values Find values that are in the list of values
notin A list of values Find values that are not in the list of values

ands and ors in conditions

Conditions can and and or a list of conditions together.

Here’s an example:

# Find states that start with 'C' and end with 'a'
# Note the conditions in the list don't have to
# use the same field.
condition:
  and:
  - condition:
      field: state
      like: 'C%'
  - condition:
      field: state
      like: '%a'

Date conditions

If the field is a date or datetime, absolute and relative dates can be defined in values using string syntax. Recipe uses the Dateparser library.

Here’s an example.

# Find sales that occured within the last 90 days.
condition:
  field: sales_date
  between:
  - '90 days ago'
  - 'tomorrow'

Labeled conditions

Conditions may optionally be labeled by adding a label property.

Quickfilters are a feature of Dimension that are defined with a list of labeled conditions.

Examples

A simple shelf with conditions

This shelf is basic.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

The results look like:

Metrics referencing other metric definitions

The following shelf has a Metric pct_teens that divides one previously defined Metric teens by another total_pop.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
total_pop:
    kind: Metric
    field: pop2000
pct_teens:
    field: '@teens'
    divide_by: '@total_pop'
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note that recipe performs safe division.

Dimensions containing buckets

Dimensions may be created by bucketing a field.

total_pop:
    kind: Metric
    field: pop2000
age_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'
mixed_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      in: ['Vermont', 'New Hampshire']
      field: state
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('mixed_buckets')\
    .metrics('total_pop')\
    .order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note this recipe orders by mixed_buckets. The buckets are ordered in the order they are defined.