Analytics with GenAI Agents (Text2SQL, CrewAI, LangChain, GPT4o)

Kapil Raina
6 min readJun 3, 2024

--

Using GenAI for data analytics, to enable a human like conversation for information needs, is now a realistic and apt use-case. Give a LLM a database schema context, write your information needs in plain language and let the model figure out the right SQLs to get to that information. Even better, use Agents to allow for orchestration of complex analytics task that provides not only raw data, but inferencing to derive meaning out of data, and then appropriate visual representations of data — a complete and powerful analytics use-case. This pattern is already embedded in commercial BI tools such as Microsoft PowerBI, but can be set up for a bespoke application database/warehouse with use of GenAI frameworks like

Llama Index : For Text2SQL NLSQL* services.

CrewAI : For Agentic Framework to co-ordinate different analytics tasks.

LangChain : To provide Model Access and Agent Tools.

And this forms the premise for this post. Here I explore the various pattern that can be used to build this Agentics Analytics Capability. Specifically, the goals to achieve are :

  • Use a Natural Language input to trigger an analytics workflow. Here I use Streamlit for bot and UX creation, but can be anlything else.
  • The tasks that need to be accomplished are 1) Detect if the request is for inferencing or data retrieval or both 2) Invoke tasks that abstract out database interactions to get the data 3) Check all possible visualisations possible on the returned data and create those visualisations.
  • Trigger Agent Crew (a CrewAI construct) that orchestrates and does the handoff of intermediate data between Agents and finally return the results.
  • Use a charting framwework to interpret the chart recommendation results and use the native feature to display the charts. Here I use Streamlit charts in the conversational bot, but could be anything else like matplotlib.

The supporting code for this post is available herehttps://github.com/kapilraina/sqldataagents

A simple PostgreSQL “orders” schema (has been used to test this :

Lets begin!

Based on the requirements, we need :

DBAgent : Delegates to data retrieval and inferencing tasks.

Charting Agent : Delegates to charting task, that reads charts definitions (here Streamlit) via a URL and creates the input data to create those visualisations.

LLM Model as Inferencing engine : Newly released GPT-4o has been used.

Lets explore the design choices to put together these agents and pros and cons of these choices.

  1. Unified Agents Crew
Unified Crew

In this approach, we create a single crew, that orchestrates between Agents and then between tasks. The crew takes a user information need and eventually returns the final output that can consist of the data collection, inference text and potential visualisations. The client code just uses the returned data to display it appropriately. Ref : run_unifiedagents.py

#Run Unified Analytics Crew
crew = AnalyticsCrew("sales",prompt)
data = crew.run()
.......
rawdata = data["rawdata"]
charts = data["charts"]
.......
#rendering code

The visual below depicts this behaviour:

The crew runs agents till, it is able to give the final results back. Using one crew means that all agents orchestrations are done with multiple database and LLM calls, and the final result is expected to be directly consumable.

Pros:

  • GenAI model is used for all heavy lifting and thus development work is simplified.
  • The data is passed between agents and tasks, without a need to maintain an intermediate state.

Cons:

  • Due to multiple agent orchestration in the crew, the final output takes extra time to be generated.
  • The extended orchestration means that the number of calls to the LLM is more and thus higher costs.
  • Since the entire data is used in the input context of charting task, it results in more token consumption and thus higher cost.
  • All data gets sent to an LLM, which may not be desirable in some cases.

2. Split Crews

Split Crews

To simplify the orchestration issues in previous approach, this pattern creates two different crews — one for data retrieval and inferencing and another for charting and visualisations. With this split crew, the user is able to see the part of results faster doesnt need to wait till the visualization results are back. The visualisation crew is triggered after data/inferencing is displayed. The intermediate code to handle intermediate data display and crew invocations needs to be put in place. This put some part of the solution on the client code, but results in lower orchestration time and in comparison to previous approach, a lower overall completion time. Ref : run_splitagents.py

# Run DB Crew First
crew = DbexecutionCrew("sales",prompt)
data = crew.run()
........
#render data
........
# Run Visualization crew next
vcrew = DataVisualizationCrew(data)
chartdata = vcrew.run()
.........
#render charts

The visual below depicts this behaviour:

Pros:

  • Better user experience since the user is able to see the results immediately.
  • Works well in scenarios where the retrieve data is either inference based on does not have meaningful visualisations possible and so user doesnt need to wait.
  • Less orchestrations since inter-agent orchestrations are eliminated. Thus lesser calls to LLMs and less token usage, and so saves cost.

Cons:

  • Since the entire data from first crew is used in the input context of charting task, it still results in more token consumption and thus higher cost.
  • All data gets sent to an LLM, which may not be desirable in some cases.

3. Optimised Split Crews

Optimised Split Crews

Turns out that LLM can do a decent job of recommending charting visualisations, only on the basis of the output structure of the data from the DB Crew i.e. just based on the structure of output data, and not the actual data. In this optimised approach, the Charting Crew receives only the data structure from the output generated from DBCrew (which is CSV based), and based on the recommended charting definitions, uses the DB Crew data to arrive at the input chart data before displaying. This means significantly more developer work, since parts of executions that were taken care of by Agents and LLM in earlier approaches are now done by the code. In return, we get benefits of leaner Agents, less LLM calls and cost savings owing to less LLM API calls and token usage. Ref : run_splitagents_optimized.py

# Run DB Crew First
crew = DbexecutionCrew("sales",prompt)
data = crew.run()
......
#Render Data
......
columnson = generate_header_attributes_json(data) #Get Attributes
# Run visualisation crew with only attributes

vcrew = DataVisualizationCrewSA(columnson,prompt)
chartdata = vcrew.run()
......
#merge charts definitions and data
chardef = chartdata["definition"]["columns"]
_cols_=[chardef["x"],chardef["y"]]
_x_ = df[_cols_[0]]
_y_ = df[_cols_[1]]
tuples = list(zip(_x_, _y_))
cdf = pd.DataFrame(tuples,columns=_cols_)
......
#Render cdf charts

The visual below depicts this behaviour:

Pros:

  • Data doesn't get sent to LLM
  • The LLM calls are reduced owing to less need for inter-Agent orchestration.
  • Token usage is minimal, since the input to LLM is either a prompt or a data structure definition.
  • Renders the data fast and visualisations faster in comparison to other approaches, since the data processing for charting definitions is done in local code.

Cons:

  • Needs relatively more code to run.

Some important aspects to consider :

  1. This example uses Streamlit conversational and charting experience, but these can be anything else based on the context of use.
  2. The Crew and Agents make multiple calls to the LLM during the reasoning process. This does take toll on your API charges. If you can run this tool this with a mid sized model locally, that may save on API usage charges.
  3. The Chart Agent has been defined to support single set of 2D chart (i.e. with one dataset of X & Y). For sake of comparison prompts that can overlay two datasets on the same task, charting agent may be extended.
  4. In this example the size of schema is small, so for the DB Task, the entire schema is sent as part of the LLM inference API call. In large schemas, this can consume tokens unnecessarily or even overflow the context window. Llama Index provides an option to put the schema is a vector database and retrieve only relevant table definitions at run time.
  5. The “sales” parameter in the code is the schema to be used. With a variable attribute, you can point this tool to any schema !!

--

--

No responses yet