Three approaches

LLMs and data
Author

Cody Peterson

Published

October 13, 2023

Introduction

The thought of using natural language to transform and analyze data is appealing. This post assumes familiarity with Marvin and Ibis – read the previous post in the series for a quick overview.

Approaches

When discussed at Voltron Data, we identified three distinct approaches to applying LLMs to data analytics that can be implemented today:

  1. LLM writes an analytic code
  2. LLM writes an analytic subroutine
  3. Use LLM in an analytic subroutine

While these three approaches are not an exhaustive list of how LLMs can be applied to data, they can be easily understood and implemented with Ibis and Marvin in a few lines of code. Together with these two open-source tools, we can build a natural language interface for data analytics that supports 18+ backends.

But first, let’s demonstrate the three approaches.

Approach 1: LLM writes analytic code

State of the art (SoTA) LLMs are decent at generating SQL out of the box. We can be clever to handle errors, retries, and more, but in its simplest form:

Code
import ibis
import marvin

from rich import print
from time import sleep
from dotenv import load_dotenv

load_dotenv()

con = ibis.connect("duckdb://penguins.ddb")
t = ibis.examples.penguins.fetch()
t = con.create_table("penguins", t.to_pyarrow(), overwrite=True)
1
Import the libraries we need.
2
Load the environment variable to setup Marvin to call our OpenAI account.
3
Setup the demo datain an Ibis backend.
import ibis
import marvin

from ibis.expr.schema import Schema
from ibis.expr.types.relations import Table


ibis.options.interactive = True
marvin.settings.llm_model = "openai/gpt-4"
1
Import Ibis and Marvin.
2
Configure Ibis and Marvin
@marvin.ai_fn
def _generate_sql_select(
    text: str, table_name: str, table_schema: Schema
) -> str:
    """Generate SQL SELECT from text."""


def sql_from_text(text: str, t: Table) -> Table:
    """Run SQL from text."""
    return t.sql(_generate_sql_select(text, t.get_name(), t.schema()).strip(";"))
1
A non-deterministic, LLM-powered AI function.
2
A deterministic, human-authored function that calls the AI function.
t2 = sql_from_text("the unique combination of species and islands", t)
t2
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ species    island    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringstring    │
├───────────┼───────────┤
│ Adelie   Torgersen │
│ Adelie   Biscoe    │
│ Adelie   Dream     │
│ Gentoo   Biscoe    │
│ ChinstrapDream     │
└───────────┴───────────┘
Code
sleep(3)
1
Avoid rate-limiting by waiting.
t3 = sql_from_text(
    "the unique combination of species and islands, with their counts, ordered from highest to lowest, and name that column just 'count'",
    t,
)
t3
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Gentoo   Biscoe   124 │
│ ChinstrapDream    68 │
│ Adelie   Dream    56 │
│ Adelie   Torgersen52 │
│ Adelie   Biscoe   44 │
└───────────┴───────────┴───────┘
Code
sleep(3)
1
Avoid rate-limiting by waiting.

This works well-enough for simple cases and can be expanded to handle complex ones. In many scenarios, it may be easier to express a query in English or another language than to write it in SQL, especially if working across multiple SQL dialects.

SQL isn’t standard, with many dialects across data platforms. Ibis works around this by providing a standard Python API for analytic code but must make compromises to support many data platforms, often via SQL in their native dialect. Substrait is a newer project that aims to solve this problem by providing a standard, portable, and extensible intermediary representation (IR) for data transformation code that Ibis and data platforms could all standardize on. Substrait is still in the early stages of development, but it’s worth keeping an eye on and will be adopted in Ibis once supported across many data platforms.

For now, we’ll focus on generating SQL and Python analytical code with LLMs.

Approach 2: LLM writes an analytical subroutine

If more complex logic needs to be expressed, SoTA LLMs are also decent at writing Python and a number of other programming languages that are used in analytical subroutines. Many data platforms support user-defined functions (UDFs) in Python or some other language. We’ll stick to scalar Python UDFs via DuckDB to demonstrate the concept:

@marvin.ai_fn
def _generate_python_function(text: str) -> str:
    """Generate a simple, typed, correct Python function from text."""


def create_udf_from_text(text: str) -> str:
    """Create a UDF from text."""
    return f"""
import ibis

@ibis.udf.scalar.python
{_generate_python_function(text)}
""".strip()
1
A non-deterministic, LLM-powered AI function.
2
A deterministic, human-authored function that calls the AI function.
udf = create_udf_from_text(
    "a function named count_vowels that given an input string, returns an int w/ the number of vowels (y_included as a boolean option defaulted to False)"
)
print(udf)
exec(udf)
import ibis

@ibis.udf.scalar.python
def count_vowels(input_string: str, y_included: bool = False) -> int:
    """Given an input string, it returns the number of vowels. y can be included as a vowel."""
    vowels = 'aeiou'
    if y_included:
        vowels += 'y'
    return sum(1 for char in input_string if char in vowels)
Code
sleep(3)
1
Avoid rate-limiting by waiting.
t4 = t3.mutate(
    species_vowel_count=count_vowels(t3.species),
    island_vowel_count=count_vowels(t3.island),
)
t4
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species    island     count  species_vowel_count  island_vowel_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64int64int64              │
├───────────┼───────────┼───────┼─────────────────────┼────────────────────┤
│ Gentoo   Biscoe   12433 │
│ ChinstrapDream    6822 │
│ Adelie   Dream    5632 │
│ Adelie   Torgersen5233 │
│ Adelie   Biscoe   4433 │
└───────────┴───────────┴───────┴─────────────────────┴────────────────────┘
Code
sleep(3)
1
Avoid rate-limiting by waiting.

In this case, there’s no reason not to have a human in the loop reviewing the output code and committing it for production use. This could be useful for quick prototyping or, given a box of tools in the form of UDFs, working through a natural language interface.

Approach 3: Use LLM in an analytical subroutine

We can also call the LLM once-per-row in the table via a subroutine. For variety, we’ll use an AI model instead of an AI function:

from pydantic import BaseModel, Field

# decrease cost
marvin.settings.llm_model = "openai/gpt-3.5-turbo-16k"


@marvin.ai_model
class VowelCounter(BaseModel):
    """Count vowels in a string."""

    include_y: bool = Field(False, description="Include 'y' as a vowel.")
    # num_a: int = Field(..., description="The number of 'a' vowels.")
    # num_e: int = Field(..., description="The number of 'e' vowels.")
    # num_i: int = Field(..., description="The number of 'i' vowels.")
    # num_o: int = Field(..., description="The number of 'o' vowels.")
    # num_u: int = Field(..., description="The number of 'u' vowels.")
    # num_y: int = Field(..., description="The number of 'y' vowels.")
    num_total: int = Field(..., description="The total number of vowels.")


VowelCounter("hello world")
1
Additional imports for Pydantic.
2
Configure Marvin to use a cheaper model.
3
A non-deterministic, LLM-powered AI model.
4
Call the AI model on some text.
VowelCounter(include_y=False, num_total=3)

Then we’ll have the LLM write the UDF that calls the LLM, just to be fancy:

udf = create_udf_from_text(
    "a function named count_vowels_ai that given an input string, calls VowelCounter on it and returns the num_total attribute of that result"
)
print(udf)
exec(udf)
import ibis

@ibis.udf.scalar.python
def count_vowels_ai(input_string: str) -> int:
    result = VowelCounter(input_string)
    return result.num_total
Code
sleep(3)
1
Avoid rate-limiting by waiting.
t5 = t3.mutate(
    species_vowel_count=count_vowels_ai(t3.species),
    island_vowel_count=count_vowels_ai(t3.island),
)
t5
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species    island     count  species_vowel_count  island_vowel_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64int64int64              │
├───────────┼───────────┼───────┼─────────────────────┼────────────────────┤
│ Gentoo   Biscoe   12433 │
│ ChinstrapDream    6821 │
│ Adelie   Dream    5632 │
│ Adelie   Torgersen5233 │
│ Adelie   Biscoe   4433 │
└───────────┴───────────┴───────┴─────────────────────┴────────────────────┘

Notice that in this UDF, unlike in the previous example, a LLM is being called (possibly several times) for each row in the table. This is a very expensive operation and we’ll need to be careful about how we use it in practice.

Code
sleep(3)
1
Avoid rate-limiting by waiting.

Summary

To summarize this post:

from rich import print

with open("index.qmd", "r") as f:
    self_text = f.read()

# increease accuracy
marvin.settings.llm_model = "openai/gpt-4"

@marvin.ai_model
class Summary(BaseModel):
    """Summary of text."""

    summary_line: str = Field(..., description="The one-line summary of the text.")
    summary_paragraph: str = Field(
        ..., description="The one-paragraph summary of the text."
    )
    conclusion: str = Field(
        ..., description="The conclusion the reader should draw from the text."
    )
    key_points: list[str] = Field(..., description="The key points of the text.")
    critiques: list[str] = Field(
        ..., description="Professional, fair critiques of the text."
    )
    suggested_improvements: list[str] = Field(
        ..., description="Suggested improvements for the text."
    )
    sentiment: float = Field(..., description="The sentiment of the text.")
    sentiment_label: str = Field(..., description="The sentiment label of the text.")
    author_bias: str = Field(..., description="The author bias of the text.")


print(Summary(self_text))
Summary(
    summary_line="The blog post titled 'Three approaches' by Cody Peterson, dated 2023-10-13, discusses three 
distinct approaches to applying Language Learning Models (LLMs) to data analytics using open-source tools Ibis and 
Marvin.",
    summary_paragraph='The author explains three approaches to applying LLMs to data analytics. The first approach 
is letting the LLM write an analytic code, specifically generating SQL. The second approach involves letting LLM 
write an analytic subroutine, demonstrated by writing Python code for user-defined functions. The third approach 
uses LLM in an analytic subroutine, illustrated by calling the LLM once-per-row in a table via a subroutine. Each 
method has potential applications and limitations, and the choice of approach will depend on the specific 
requirements of the data analytics task.',
    conclusion='LLMs can be used to transform and analyze data in various ways. By using these approaches in 
combination with open-source tools like Ibis and Marvin, it is possible to build a natural language interface for 
data analytics that supports multiple backends. The author concludes by introducing an open-source data & AI 
project for building next-generation natural language interfaces to data, Ibis Birdbrain.',
    key_points=[
        'The LLM can write an analytic code, particularly generating SQL.',
        'The LLM can also write an analytic subroutine, exemplified by writing Python code for user-defined 
functions.',
        'The LLM can be used in an analytic subroutine, shown by calling the LLM once-per-row in a table via a 
subroutine.',
        'The author suggests that each approach has its potential applications and limitations.',
        'The choice of approach will depend on the specific requirements of the data analytics task.'
    ],
    critiques=[
        'The author could have provided more concrete examples or case studies to illustrate the application of 
these approaches.',
        'A comparison of the advantages and disadvantages of each approach would have been helpful.'
    ],
    suggested_improvements=[
        'The author could include real-world applications or examples of each approach.',
        'A detailed comparison of the pros and cons of each approach would provide better guidance for readers.'
    ],
    sentiment=0.25,
    sentiment_label='Neutral',
    author_bias='The author shows a positive bias towards the use of open-source tools, Ibis and Marvin, for data 
analytics.'
)

Next steps

You can get involved with Ibis Birdbrain, our open-source data & AI project for building next-generation natural language interfaces to data.

Read the next post in this series.

Back to top