End-to-End LLM Monitoring, Observability, Evaluation and Calibration — Part 1
Hands-on demo illustration with building LLM-powered applications & inspecting it
The accompanying code for the app is here
In this blog post, I will demonstrate how to build a LLM-powered solution, and walk through on using Langfuse for LLM Observability and Evaluation. For more information on Langfuse & how to get started — checkout my previous tutorial here. In very simple layman terms — “Langfuse is an open-source LLM engineering platform that helps you to analyze, and iterate over your LLM applications.”
This tutorial will guide you practically on several aspects of LLM Observability with demonstrations split in as Part One, Part Two of a two-series post. In the first part of this blog series, I will show you how to build a LLM-based chatbot with Llama and analyze it with Langfuse. And, extend it further where I will show how our general chatbot can be restricted to filter non-travel related queries.
In the second part of this blog series, I will showcase how the built simple travel chatbot can now be tailored for more specific use-case of answering user’s travel related queries and recommending holiday packages as a Travel Planner AI chatbot. In continuation, I will also illustrate how to calibrate our built Travel Planner AI Assistant chatbot. So, stay tuned!!!
Note:- The prerequisites needed for you to try this demo hands-on are:-
A) Setting up Llama B) Setting up Langfuse [self-host/cloud], which I have already covered it in detail, in my previous post. So, pls do check it out as I won’t repeat it again for this tutorial.
Part One:
Building general CHATBOT with open-source LLM and monitoring with open-source LLM engineering platform
We will use Llama via ollama and langfuse (cloud version) for experiment tracking to test the chatbot built locally using open-source LLM and introspect its effectiveness before piloting it into pre-production/productional settings. For setting up ollama & langfuse — follow this post.
The steps mentioned below, provides a high-level overview of how we are going to proceed with building our application.
Steps Overview:-
- Build a chatbot using Gradio
Chatbot
- Add Langfuse Tracing to the chatbot
- Implement additional Langfuse tracing features used frequently in chat applications: chat sessions, user feedback
So, let’s get started —
First let’s create a virtual environment using conda
conda create -n langfuse_venv python=3.10
or using the Virtualenv
py -3.10 -m venv langfuse_venv
followed by activating our created Virtual Environment
Next, clone the git repo
git clone https://github.com/ajayarunachalam/LLM_Observability_Demo
Once the repository is cloned, navigate to part_one_blog_code folder, and install the necessary libraries as
pip install -r requirements.txt
In the cloned repo — update your Langfuse API keys in the file “.env” that we got from while setting up Langfuse (refer snippet as illustrated below)
Note:- According to your region, comment/uncomment accordingly. Once you have setup your Langfuse API keys, then we can proceed to build our chatbot interface with Gradio.
Import libraries
Implementation of Chat functions
Sessions/Threads
Each chat message belongs to a thread in the Gradio Chatbot which can be reset using clear
. We implement the following method that creates a session_id
that is used globally and can be reset via the set_new_session_id
method. This session_id will be used for Langfuse Sessions.
Response handler
When implementing the respond
method, we use the Langfuse @observe()
decorator to automatically log each response to Langfuse Tracing. And, get completion via ollama.
User feedback handler
We implement user feedback tracking in Langfuse via the like
event for the Gradio chatbot. This method reuses the current trace id available in the global state of this application. We also additionally capture the user comment, satisfaction rating, and any additional comments to Langfuse for the current trace id.
Retries
We also allow to retry a completion via the Gradio Chatbot retry
event (docs). Note:- This is not specific to the integration with Langfuse.
Implementing Gradio Chatbot
After implementing all methods above, we can now put everything together with the Gradio Chatbot inclusive of all our components.
Run Gradio Chatbot
Now time to test our built chatbot and launch it as illustrated below.
Navigate to the provided local URL and open it in your browser which should launch our bot application as seen below.
Now let’s test our chatbot with a user query — “Hello, who are you?”. It provides a response message as seen below — “I’m an AI chatbot…”
Now let’s provide a user-feedback for this response — with a “like”, providing a “comment”, providing a “satisfaction rating” and also providing “additional comments”
Once we record the human feedback, you can notice as seen below that for our current trace ID: “ff08bc29-….” all the feedback is captured and associated to Langfuse with our current trace ID.
Explore data in Langfuse
When interacting with the chatbot, you should see traces, sessions, and feedback scores in your Langfuse project as shown below. Navigate, to dashboard where can see all important metrics to monitor your LLM application that includes overall volume usage by model or token types, cost breakdowns by user, latency distributions and quality metrics tracing.
We can inspect each individual traces that reflect all the interactions with our chatbot.
Sessions are a way to group these traces together and replay interactions between users and our LLM based application. Below, you can an example session with user-feedback score via Scores that serves as objects for storing evaluation metrics in Langfuse.
The specific code for this part can be found here.
Part Two:
Building Travel Chatbot
Now that we saw how to build a LLM-based chatbot, let’s see how we can customize our general chatbot to only answer any queries related to travel. For that we will tweak our previous bot application to respond only if the user queries includes any of the travel keywords. Our example travel keywords list includes — “travel”, “visit”, “trip”, “flight”, “hotel”, “vacation”, “booking”, “destination”, “itinerary”, “holiday”, “plan”. The few changes that we need to make is firstly with the system prompt and secondly modifying the respond function to filter any non-travel related user queries.
Rest everything remains exactly the same. Now let’s run our travel bot as demonstrated below and run the APP with the provided local URL in our browser.
Once you launch the APP on your browser, you chat see our Travel bot interface as below. Now let’s test our bot with a user query — “Hello, who are you?”. As noticed, it provides a response message mentioning that it can only assist with travel related queries. That’s a great start!
Now, let’s test our travel bot with some specific travel related questions. We ask the query —“ I want to travel to Spain. Tell me the must visit places for a two day stay in Barcelona” as seen below with the response received from our travel bot. We then provide user-feedback for the received response by clicking the like handler, adding a comment, providing the rating, and adding any additional comments.
All the interactions with our travel bot are captured. One can see the traces along with user-feedback score as highlighted below. For other common evaluation methods refer this documentation.
The specific code for this part can be found here.
That’s the end of our first part of this tutorial series. In the second part of this blog series — I will show how given a dataset containing information about holiday packages, how we will customize our Travel chatbot as Travel Planner AI Assistant chatbot consuming the dataset and aim to provide accurate holiday recommendations based on the user requirements. So, stay tuned!!!.
Thank you for reading. I hope you learned something today. If you liked it encourage me to publish more contents with your support & love as a clap 👏
Connect with me
You can reach me at ajay.arunachalam08@gmail.com or connect me through Linkedin
“Knowledge is Power” — So, always constantly keep updating your knowledge & upskill by learning and sharing!!!
Check my GIT REPO — here
About Me
I am a Cloud Solution Architect & Machine Learning Specialist. In the past, I have worked in Telecom, Retail, Banking and Finance, Healthcare, Media, Marketing, Education, Agriculture, and Manufacturing sectors. I have 7+ years of experience in delivering Data Science & Analytic solutions of which 6+ years of experience is client facing. I have Lead & Managed, several large teams of Data Scientists, Data engineers, ML engineers, Data analysts & Business analysts. Also, I am experienced with Technical/Management skills in the area of business intelligence, data warehousing, reporting and analytics holding Microsoft Certified Power BI Associate Certifications. I have worked on several key strategic & data-monetization initiatives in the past. Being a certified Scrum Master, I deeply practice agile principles while focusing on collaboration, customer, continuous improvement, and rapid delivery.
Acknowledgments
- I take this opportunity to acknowledge & thank —
Langfuse
,Langchain
,Meta
,Gradio
,PromptCloud
,DataStock
and others. Special mention to “Tarun Mamidi”