When discussed at Voltron Data, we identified three distinct approaches to applying LLMs to data analytics that can be implemented today:
LLM writes an analytic code
LLM writes an analytic subroutine
Use LLM in an analytic subroutine
While these three approaches are not an exhaustive list of how LLMs can be applied to data, they can be easily understood and implemented with Ibis and Marvin in a few lines of code. Together with these two open-source tools, we can build a natural language interface for data analytics that supports 18+ backends.
But first, let’s demonstrate the three approaches.
Approach 1: LLM writes analytic code
State of the art (SoTA) LLMs are decent at generating SQL out of the box. We can be clever to handle errors, retries, and more, but in its simplest form:
t3 = sql_from_text("the unique combination of species and islands, with their counts, ordered from highest to lowest, and name that column just 'count'", t,)t3
This works well-enough for simple cases and can be expanded to handle complex ones. In many scenarios, it may be easier to express a query in English or another language than to write it in SQL, especially if working across multiple SQL dialects.
SQL isn’t standard, with many dialects across data platforms. Ibis works around this by providing a standard Python API for analytic code but must make compromises to support many data platforms, often via SQL in their native dialect. Substrait is a newer project that aims to solve this problem by providing a standard, portable, and extensible intermediary representation (IR) for data transformation code that Ibis and data platforms could all standardize on. Substrait is still in the early stages of development, but it’s worth keeping an eye on and will be adopted in Ibis once supported across many data platforms.
For now, we’ll focus on generating SQL and Python analytical code with LLMs.
Approach 2: LLM writes an analytical subroutine
If more complex logic needs to be expressed, SoTA LLMs are also decent at writing Python and a number of other programming languages that are used in analytical subroutines. Many data platforms support user-defined functions (UDFs) in Python or some other language. We’ll stick to scalar Python UDFs via DuckDB to demonstrate the concept:
@marvin.ai_fndef _generate_python_function(text: str) ->str:"""Generate a simple, typed, correct Python function from text."""def create_udf_from_text(text: str) ->str:"""Create a UDF from text."""returnf"""import ibis@ibis.udf.scalar.python{_generate_python_function(text)}""".strip()
1
A non-deterministic, LLM-powered AI function.
2
A deterministic, human-authored function that calls the AI function.
udf = create_udf_from_text("a function named count_vowels that given an input string, returns an int w/ the number of vowels (y_included as a boolean option defaulted to False)")print(udf)exec(udf)
import ibis
@ibis.udf.scalar.python
def count_vowels(input_string: str, y_included: bool = False) -> int:
"""Given an input string, it returns the number of vowels. y can be included as a vowel."""
vowels = 'aeiou'
if y_included:
vowels += 'y'
return sum(1 for char in input_string if char in vowels)
In this case, there’s no reason not to have a human in the loop reviewing the output code and committing it for production use. This could be useful for quick prototyping or, given a box of tools in the form of UDFs, working through a natural language interface.
Approach 3: Use LLM in an analytical subroutine
We can also call the LLM once-per-row in the table via a subroutine. For variety, we’ll use an AI model instead of an AI function:
from pydantic import BaseModel, Field# decrease costmarvin.settings.llm_model ="openai/gpt-3.5-turbo-16k"@marvin.ai_modelclass VowelCounter(BaseModel):"""Count vowels in a string.""" include_y: bool= Field(False, description="Include 'y' as a vowel.")# num_a: int = Field(..., description="The number of 'a' vowels.")# num_e: int = Field(..., description="The number of 'e' vowels.")# num_i: int = Field(..., description="The number of 'i' vowels.")# num_o: int = Field(..., description="The number of 'o' vowels.")# num_u: int = Field(..., description="The number of 'u' vowels.")# num_y: int = Field(..., description="The number of 'y' vowels.") num_total: int= Field(..., description="The total number of vowels.")VowelCounter("hello world")
1
Additional imports for Pydantic.
2
Configure Marvin to use a cheaper model.
3
A non-deterministic, LLM-powered AI model.
4
Call the AI model on some text.
VowelCounter(include_y=False, num_total=3)
Then we’ll have the LLM write the UDF that calls the LLM, just to be fancy:
udf = create_udf_from_text("a function named count_vowels_ai that given an input string, calls VowelCounter on it and returns the num_total attribute of that result")print(udf)exec(udf)
Notice that in this UDF, unlike in the previous example, a LLM is being called (possibly several times) for each row in the table. This is a very expensive operation and we’ll need to be careful about how we use it in practice.
Code
sleep(3)
1
Avoid rate-limiting by waiting.
Summary
To summarize this post:
from rich importprintwithopen("index.qmd", "r") as f: self_text = f.read()# increease accuracymarvin.settings.llm_model ="openai/gpt-4"@marvin.ai_modelclass Summary(BaseModel):"""Summary of text.""" summary_line: str= Field(..., description="The one-line summary of the text.") summary_paragraph: str= Field( ..., description="The one-paragraph summary of the text." ) conclusion: str= Field( ..., description="The conclusion the reader should draw from the text." ) key_points: list[str] = Field(..., description="The key points of the text.") critiques: list[str] = Field( ..., description="Professional, fair critiques of the text." ) suggested_improvements: list[str] = Field( ..., description="Suggested improvements for the text." ) sentiment: float= Field(..., description="The sentiment of the text.") sentiment_label: str= Field(..., description="The sentiment label of the text.") author_bias: str= Field(..., description="The author bias of the text.")print(Summary(self_text))
Summary(summary_line="The blog post titled 'Three approaches' by Cody Peterson, dated 2023-10-13, discusses three distinct approaches to applying Language Learning Models (LLMs) to data analytics using open-source tools Ibis and Marvin.",
summary_paragraph='The author explains three approaches to applying LLMs to data analytics. The first approach is letting the LLM write an analytic code, specifically generating SQL. The second approach involves letting LLM write an analytic subroutine, demonstrated by writing Python code for user-defined functions. The third approach uses LLM in an analytic subroutine, illustrated by calling the LLM once-per-row in a table via a subroutine. Each method has potential applications and limitations, and the choice of approach will depend on the specific requirements of the data analytics task.',
conclusion='LLMs can be used to transform and analyze data in various ways. By using these approaches in combination with open-source tools like Ibis and Marvin, it is possible to build a natural language interface for data analytics that supports multiple backends. The author concludes by introducing an open-source data & AI project for building next-generation natural language interfaces to data, Ibis Birdbrain.',
key_points=['The LLM can write an analytic code, particularly generating SQL.',
'The LLM can also write an analytic subroutine, exemplified by writing Python code for user-defined functions.',
'The LLM can be used in an analytic subroutine, shown by calling the LLM once-per-row in a table via a subroutine.',
'The author suggests that each approach has its potential applications and limitations.',
'The choice of approach will depend on the specific requirements of the data analytics task.'],
critiques=['The author could have provided more concrete examples or case studies to illustrate the application of these approaches.',
'A comparison of the advantages and disadvantages of each approach would have been helpful.'],
suggested_improvements=['The author could include real-world applications or examples of each approach.',
'A detailed comparison of the pros and cons of each approach would provide better guidance for readers.'],
sentiment=0.25,
sentiment_label='Neutral',
author_bias='The author shows a positive bias towards the use of open-source tools, Ibis and Marvin, for data analytics.')
Next steps
You can get involved with Ibis Birdbrain, our open-source data & AI project for building next-generation natural language interfaces to data.