Enable safe Chat with your Databases: with Malloy and PandasAI

Data Explorations in Natural Language

Sep 17, 2024

Introduction - Implications of connecting a database directly to a LLM

When it comes to reaping value from data - one obstacle unremittingly stood in the way for most organisations. Which is the limited capability of the non technical workforce to perform queries and data exploration tasks on databases and data sources. This is because most data analysis tasks still require skill in SQL or SQL like languages.

One way to broaden access to data services for performing data exploration and queries easily, is to integrate data sources (DBMSs, datalakes, files, etc) to a LLM or genAI. Enabling natural language (NL here onwards) queries for data sources broadens access and utility of data assets, open and proprietary. At the same time, the business is able to tap into new sources of domain expertise for high impact value innovations (e.g. internal lob users, customers, partners, etc).

While feasible, in reality connecting a LLM directly to databases has its own predicaments. Malicious code, mistakes from inexperienced users and other operational errors can directly affect database content, objects or trigger an unintended workflow by accident. If running on a multi-cloud environment, the cost of API calls for code execution can run uncontrollably high when utility spikes. Whitelisting modules and data sources alone may not be sufficient to safeguard against such mishaps.

In this blog we discuss a safe and cost effective approach to achieve natural language querying for databases. The proposed architecture employs a semantic data model (Malloy models) and a dataframe wrapper to interface with LLMs (PandasAI).

2.0. Brief introduction to Key Technologies

Over the years enabling NL queries for DBMS/DW has become easy with availability of various tools and solutions. The following technologies speed development and delivery of such capabilities for businesses. In addition, the architecture employs a semantic data model (as illustrated in diagram 1) to preserve integrity of data sources and execute queries locally. Together the technologies cut API costs by running queries locally, prevents malicious codes from reaching data sources and increases value of investment in data artefacts.

PandasAI. PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas and polars dataframes, to make dataframes conversational. In essence, PandasAI is a wrapper service to connect various types of data tables with LLMs/gen AI via API connection to enable NL queries. PandasAI classes come ready with features for performing NL querying, visualisation, cleansing and featurisation over dataframes. Users can choose from a variety of LLMs such as BambooLLM (by default), OpenAI, Google PaLM, HuggingFace, LangChain, Azure Open AI and AWS Bedrock. Models can be retrained with vectorised data sets to polish results and added with new skills for query response enhancements. PandasAI offers a free tier for experimentation and a paid tier for production.

Malloy. Malloy is an analytic language which runs on relational databases and supports DuckDB, PostgreSQL and BigQuery. Malloy compiles into SQL but coders can embed SQL statements within its code when needed. Malloy is distinctive from the standard SQL as it: i) separates data models and queries; ii) all Malloy models and named queries are reusable in an analytic dev environment; iii) adopts a network data model to natively support nested structures efficiently (e.g joins, calculations, etc). Furthermore, Malloy models generate in-memory objects which fetch fresh data at each run keeping query results consistent with the source. This makes Malloy a great language for both analytics development and semantic data modelling to deliver efficient locally run queries and free traffic to data sources.

DuckDB. Both Malloy and PandasAI have in-built integration to DuckDB to load data and sustain in memory data objects. DuckDB is a modern, in-process SQL OLAP database management system with rich features. DuckDB uses a vectorized query processing engine and provides fast responses using apache parquet files for storage. DuckDB comes with built-in support for handling various file formats such as CSVs, parquets, excels, JSONs, iceberg tables, delta DW and various DBMS.

3.0. Architectural overview for enabling Conversations with Databases

In order to lower the risk posed by a direct connection between LLMs and databases, the proposed architecture employs a semantic data layer between the data sources and the LLMs. The architecture is divided into 3 key layers as the following: i) data ingestion and pre-processing; ii) a semantic data layer for data modelling and transformation into dataframes; and finally the iii) dataframe wrapper and LLM integration.

Diagram 1 - High Level Solution Architecture for adding NL Querying Capabilities

3.1. Ingesting data into Malloy Models.

Source data including relational tables, csv, parquet, JSON and excel are ingested into Malloy models through DuckDb. Duckdb has direct extensions to PostgreSQL, MySQL, SQlite, delta and iceberg which allows it to read and write data from a running database. In addition, duckdb can import csv, parquet, json and excel to directly query or store as tables. Other relational sources such as Oracle or SQL Server require additional steps to migrate via third party data movement or database replication systems such as Airbyte.

3.2. Malloy Semantic Data Layer.

The semantic data layer is powered by Malloy and a code repository with rights and version management. In this layer, users build data models, queries, views and store the finished Malloy models in code repository. All generated Malloy models are reusable across the dev environment. Teams can enhance data models and queries In the repository collaboratively. Malloy uses DuckDB to load raw data sources into in-memory data models for query execution.

A newly created data source in Malloy is enhanced with new dimensions, measures and table joins as shown in diagram 2 below. Furthermore, unwanted columns can be dropped using Malloy statements such as ‘accept’ and ’except’ to keep models optimal. Malloy comes with rich features to shape data from one or multiple sources to meet various types of needs.

Diagram 2 - Sample data model construction in Malloy

Malloy integration with Python, allows Malloy to transform query responses into pandas dataframes, ready for passing to PandasAI. As aforementioned, by default both the data models and query results are in-memory objects, though users are free to materialise output tables when a use case can benefit from such an approach.

Diagram 3 - Sample queries in Malloy

The semantic data model powered by Malloy as shown in diagram 1 above, separates the data sources from LLMs. This means when a NL query is processed, LLMs generate python codes for execution in dataframe and not in the source databases. This limits exposure to risks mentioned earlier, reduces API calls to source, helps the team streamline the data layer and manage data models in an orderly fashion.

3.3. PandasAI

PandasAI is a dataframe wrapper service to integrate Pandas and Polar dataframes to self hosted or managed LLMs via API connection. The LLMs respond to NL queries in strings, dataframe and plot formats. PandasAI offers three classes to power NL queries to data sources : i) Smart Dataframe; ii) Smart Datalake and iii) Agent classes.

SmartDataframe. The PandasAI SmartDataframe makes calls to the connected LLM/gen AI model via API connection to generate python codes from natural language for execution on dataframes. PandasAI by default uses Bamboo LLM and users are free to leverage other models of their choice or implement a local LLM model.

Diagram 4 - Sample implementation of PandasAI SmartDataframe

The SmartDataframe accepts a single dataframe with NL query, hints on the type of response expected (e.g. string, numbers, plot, dataframe) and returns a response respectively.

SmartDatalake. The PandasAI SmartDatalake works similarly to SmartDataframe but accepts multiple dataframes in its pipeline for executing NL queries.

Diagram 5 - Sample implementation of PandasAI SmartDatalake

SmartDatalake automatically builds a union of the input dataframe tables Internally for query execution. This is extremely useful when users are experimenting with new sets of data (not shaped or available in semantic layer just yet).

Agent. PandasAI Agent has additional features to retain context on conversational queries, provide clarification and explain responses generated.

Diagram 6 - Sample implementation of PandasAI Agent

All three classes allow users to add new skills, retrain and whitelist sources and modules for LLMs as required.

4.0. Retraining PandasAI

All three classes allow users to add new skills and retrain with business rules; best practices; question and answers; exception handling; values and virtues of the business using the ‘train’ method. By default, training data is stored in ‘BambooVectorStore’, alternatively users can opt to use other in house vector stores such as ‘Qdrant’ or ‘Pinecone’.

Diagram 7 - Sample implementation of new skills for PandasAI Agent

PandasAI also offers advanced agents for JSON code generation, pipeline and security validation which can be used in conjunction with a normal agent to refine conversational experience and efficiency.

Monitoring data on query performance and accuracy can be extremely useful in identifying types of training required to improve agents over time.

5.0. Strategies to limit Risk Exposure and Cost

Malloy Semantic Data Layer. The Malloy semantic data layer in this architecture attributes to lower calls to the source database, and in the process directs PandasAI to locally execute queries in the dataframe. This is especially beneficial when the data warehouse is sitting in a different cloud as opposed to the LLM agent (lesser API calls and data egress). Malloy models also prevent stale data as every instance the model is executed, it re-populates data models with fresh data and keeps data in memory for query executions. In this context, Malloy offers immense flexibility to data professionals to enhance the design of their Semantic Data Layer to manage in house data models, data policies, contributors, malicious code exposure, authenticate access and manage rights for use.

LLM / GenAI hosting. Another important consideration to examine is the chosen LLM service itself. It is extremely helpful to connect to a hosted LLM provider to speed experimentation, development and service delivery, but as the volume of queries increases, the cost of API calls will quickly become intractable for some businesses with tighter budgets. For use cases with larger production and query volumes consider hosting your own LLMs. Review use cases, find suitable LLMs (e.g. Ollama), learn the ropes to manage LLM in-house to facilitate large productions and move away from the API price model for better control of cost.

Conclusions - Balancing ‘benefit—cost—risk’

Capabilities to execute queries in natural language can open up unique opportunities for businesses. It taps into participation of non-technical workforce in shaping business data and insights for unprecedented level of value innovation.

Nevertheless, bridging the gaps between the ‘benefit—cost—risk’ elements of NL exploration capabilities for data sources is crucial for widespread adoption and implementation. The Malloy semantic data model complements the powerful PandasAI LLMs to stay in context for high accuracy and lower hallucinations. At the same time the combo minimises calls to data sources, keeps data fresh and executes queries locally to reduce overall compute cost and minimise exposure to malicious codes.

Other References

Building Semantic Data Models in Malloy; Add conversational capabilities to DBMS with PandasAI; Data migration with Airbyte

Mindful Geek

Discussion about this post