In this “LLMs and data” series, we’ll explore how to apply large-language models (LLMs) to data analytics. We’ll walk through the steps to build Ibis Birdbrain.
Throughout the series, we’ll be using Marvin and Ibis. A brief introduction to each is provided below.
Marvin
Marvin is an AI engineering framework that makes it easy to build up to an interactive conversational application.
Marvin makes calls to an AI platform. You typically use an API key set as an environment variable – in this case, we’ll load a .env file that contians secrets for the AI platform that Marvin will use. We also set the large language model model.
import marvinfrom rich importprintfrom time import sleepfrom dotenv import load_dotenvload_dotenv()# increase accuracymarvin.settings.llm_model ="openai/gpt-4"# decrease cost# marvin.settings.llm_model = "openai/gpt-3.5-turbo"test_str ="working with data and LLMs on 18+ data platforms is easy!"test_str
1
Import the libraries we need.
2
Load the environment variable to setup Marvin to call our OpenAI account.
3
Configure the LLM model to use.
4
Some text to test on
'working with data and LLMs on 18+ data platforms is easy!'
Functions
AI functions are one of the building blocks in Marvin and allow yout to specify a typedpython function with no code – only a docstring – to achieve a wide variety of tasks.
We’ll demonstrate this with an AI function that trnaslates text:
@marvin.ai_fndef translate(text: str, from_: str="English", to: str="Spanish") ->str:"""translates the text"""translate(test_str)
'¡Trabajar con datos y LLM en más de 18 plataformas de datos es fácil!'
'Working with data and LLMs on more than 18 data platforms is easy!'
Code
sleep(3)
1
Avoid rate-limiting by waiting.
Models
AI models are another building block for generatingpython classes from input text. It’s a great way to build structured data from unstructured data that can be customized for your needs.
We’ll demosntrate this with an AI model that translates text:
from pydantic import BaseModel, Field# decrease costmarvin.settings.llm_model ="openai/gpt-3.5-turbo"@marvin.ai_modelclass ExtractParts(BaseModel):"""Extracts parts of a sentence""" subject: str= Field(..., description="The subject of the sentence.") predicate: str= Field(..., description="The predicate of the sentence.") objects: list[str] = Field(..., description="The objects of the sentence.") modifiers: list[str] = Field(..., description="The modifiers of the sentence.")ExtractParts(test_str)
ExtractParts(subject='working with data and LLMs', predicate='is', objects=['easy'], modifiers=['on 18+ data platforms'])
Code
sleep(1)
1
Avoid rate-limiting by waiting.
Classifiers
AI classifiers are another building block for generatingpython classes from input text. It’s the most efficient (time and cost) method for applying LLMs as it only results in a single output token, selecting an output in a specified Enum.
We’ll demonstrate this by classifying the language of some text:
from enum import Enum# increase accuracymarvin.settings.llm_model ="openai/gpt-4"@marvin.ai_classifierclass IdentifyLanguage(Enum):"""Identifies the language of the text""" english ="English" spanish ="Spanish"IdentifyLanguage(test_str).value
'English'
Code
sleep(1)
1
Avoid rate-limiting by waiting.
IdentifyLanguage(translate(test_str)).value
'Spanish'
Code
sleep(3)
1
Avoid rate-limiting by waiting.
Ibis
Ibis is the portable Python dataframe library that enables Ibis Birdbrain to work on many data platforms.
Ibis makes calls to a data platform, providing an API but pushing the compute to (local or remote) query engines and storage. DuckDB is the default and we’ll typically use it for demo puroses. You can work with an in-memory instance, but we’ll often create a database file from example data:
Connect to the data and load a table into a variable.
Backend
A backend provides the connection and basic management of the data platform. Above, we created the con variable that is an instance of a DuckDB backend:
con
<ibis.backends.duckdb.Backend at 0x130ee91d0>
It usually contains some tables:
con.list_tables()
['penguins']
We can access some internals of Ibis to see what backends are available:
Tip
Don’t rely on accessing internals of Ibis in production.
backends = [entrypoint.name for entrypoint in ibis.util.backend_entry_points()]backends
When working with many tables, you should name them descriptively.
Schema
A table has a schema that Ibis maps to the data platform’s data types:
t.schema()
ibis.Schema {
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
}
LLMs and data: Marvin and Ibis
You can use Marvin and Ibis together to easily apply LLMs to data.
from ibis.expr.schema import Schemafrom ibis.expr.types.relations import Table@marvin.ai_fndef sql_select( text: str, table_name: str= t.get_name(), schema: Schema = t.schema()) ->str:"""writes the SQL SELECT statement to query the table according to the text"""query ="the unique combination of species and islands"sql = sql_select(query).strip(";")sql