End-to-End LLM Monitoring, Observability, Evaluation and Calibration — Part 1

8 min read6 days ago

Hands-on demo illustration with building LLM-powered applications & inspecting it

The accompanying code for the app is here

In this blog post, I will demonstrate how to build a LLM-powered solution, and walk through on using Langfuse for LLM Observability and Evaluation. For more information on Langfuse & how to get started — checkout my previous tutorial here. In very simple layman terms — “Langfuse is an open-source LLM engineering platform that helps you to analyze, and iterate over your LLM applications.”

This tutorial will guide you practically on several aspects of LLM Observability with demonstrations split in as Part One, Part Two of a two-series post. In the first part of this blog series, I will show you how to build a LLM-based chatbot with Llama and analyze it with Langfuse. And, extend it further where I will show how our general chatbot can be restricted to filter non-travel related queries.

In the second part of this blog series, I will showcase how the built simple travel chatbot can now be tailored for more specific use-case of answering user’s travel related queries and recommending holiday packages as a Travel Planner AI chatbot. In continuation, I will also illustrate how to calibrate our built Travel Planner AI Assistant chatbot. So, stay tuned!!!

Note:- The prerequisites needed for you to try this demo hands-on are:-

A) Setting up Llama B) Setting up Langfuse [self-host/cloud], which I have already covered it in detail, in my previous post. So, pls do check it out as I won’t repeat it again for this tutorial.

Part One:

Building general CHATBOT with open-source LLM and monitoring with open-source LLM engineering platform

We will use Llama via ollama and langfuse (cloud version) for experiment tracking to test the chatbot built locally using open-source LLM and introspect its effectiveness before piloting it into pre-production/productional settings. For setting up ollama & langfuse — follow this post.

The steps mentioned below, provides a high-level overview of how we are going to proceed with building our application.

Steps Overview:-

Build a chatbot using Gradio Chatbot
Add Langfuse Tracing to the chatbot
Implement additional Langfuse tracing features used frequently in chat applications: chat sessions, user feedback

So, let’s get started —

First let’s create a virtual environment using conda

conda create -n langfuse_venv python=3.10

or using the Virtualenv

py -3.10 -m venv langfuse_venv

followed by activating our created Virtual Environment

Next, clone the git repo

git clone https://github.com/ajayarunachalam/LLM_Observability_Demo

Once the repository is cloned, navigate to part_one_blog_code folder, and install the necessary libraries as

pip install -r requirements.txt

In the cloned repo — update your Langfuse API keys in the file “.env” that we got from while setting up Langfuse (refer snippet as illustrated below)

Note:- According to your region, comment/uncomment accordingly. Once you have setup your Langfuse API keys, then we can proceed to build our chatbot interface with Gradio.

Import libraries

Implementation of Chat functions

Sessions/Threads

Each chat message belongs to a thread in the Gradio Chatbot which can be reset using clear. We implement the following method that creates a session_id that is used globally and can be reset via the set_new_session_id method. This session_id will be used for Langfuse Sessions.

Response handler

When implementing the respond method, we use the Langfuse @observe() decorator to automatically log each response to Langfuse Tracing. And, get completion via ollama.

User feedback handler

We implement user feedback tracking in Langfuse via the like event for the Gradio chatbot. This method reuses the current trace id available in the global state of this application. We also additionally capture the user comment, satisfaction rating, and any additional comments to Langfuse for the current trace id.

Retries

We also allow to retry a completion via the Gradio Chatbot retry event (docs). Note:- This is not specific to the integration with Langfuse.

Implementing Gradio Chatbot

After implementing all methods above, we can now put everything together with the Gradio Chatbot inclusive of all our components.

Run Gradio Chatbot

Now time to test our built chatbot and launch it as illustrated below.

Image by author: Running general QA chatbot

Navigate to the provided local URL and open it in your browser which should launch our bot application as seen below.

Image by author: Launching our built Chatbot

Now let’s test our chatbot with a user query — “Hello, who are you?”. It provides a response message as seen below — “I’m an AI chatbot…”

Now let’s provide a user-feedback for this response — with a “like”, providing a “comment”, providing a “satisfaction rating” and also providing “additional comments”

Image by author: Interacting with the chatbot & capturing human feedback

Once we record the human feedback, you can notice as seen below that for our current trace ID: “ff08bc29-….” all the feedback is captured and associated to Langfuse with our current trace ID.

Image by author: Capturing events with current trace ID

Explore data in Langfuse

When interacting with the chatbot, you should see traces, sessions, and feedback scores in your Langfuse project as shown below. Navigate, to dashboard where can see all important metrics to monitor your LLM application that includes overall volume usage by model or token types, cost breakdowns by user, latency distributions and quality metrics tracing.

Image by author: Langfuse Dashboard reflecting traces of our chatbot interaction

We can inspect each individual traces that reflect all the interactions with our chatbot.

Image by author: Inspecting all the traces tracked

Sessions are a way to group these traces together and replay interactions between users and our LLM based application. Below, you can an example session with user-feedback score via Scores that serves as objects for storing evaluation metrics in Langfuse.

Image by author: Inspecting a particular session with user evaluation.

The specific code for this part can be found here.

Part Two:

Building Travel Chatbot

Now that we saw how to build a LLM-based chatbot, let’s see how we can customize our general chatbot to only answer any queries related to travel. For that we will tweak our previous bot application to respond only if the user queries includes any of the travel keywords. Our example travel keywords list includes — “travel”, “visit”, “trip”, “flight”, “hotel”, “vacation”, “booking”, “destination”, “itinerary”, “holiday”, “plan”. The few changes that we need to make is firstly with the system prompt and secondly modifying the respond function to filter any non-travel related user queries.

Rest everything remains exactly the same. Now let’s run our travel bot as demonstrated below and run the APP with the provided local URL in our browser.

Once you launch the APP on your browser, you chat see our Travel bot interface as below. Now let’s test our bot with a user query — “Hello, who are you?”. As noticed, it provides a response message mentioning that it can only assist with travel related queries. That’s a great start!

Image by author: Launched Travel Chatbot

Now, let’s test our travel bot with some specific travel related questions. We ask the query —“ I want to travel to Spain. Tell me the must visit places for a two day stay in Barcelona” as seen below with the response received from our travel bot. We then provide user-feedback for the received response by clicking the like handler, adding a comment, providing the rating, and adding any additional comments.

Image by author: Launched Travel Chatbot with specific query related to travel

All the interactions with our travel bot are captured. One can see the traces along with user-feedback score as highlighted below. For other common evaluation methods refer this documentation.

Image by author: Inspecting trace, session, and user feedback score of Travel bot in Langfuse

The specific code for this part can be found here.

That’s the end of our first part of this tutorial series. In the second part of this blog series — I will show how given a dataset containing information about holiday packages, how we will customize our Travel chatbot as Travel Planner AI Assistant chatbot consuming the dataset and aim to provide accurate holiday recommendations based on the user requirements. So, stay tuned!!!.

Thank you for reading. I hope you learned something today. If you liked it encourage me to publish more contents with your support & love as a clap 👏

Connect with me

You can reach me at ajay.arunachalam08@gmail.com or connect me through Linkedin

“Knowledge is Power” — So, always constantly keep updating your knowledge & upskill by learning and sharing!!!

Check my GIT REPO — here

About Me

I am a Cloud Solution Architect & Machine Learning Specialist. In the past, I have worked in Telecom, Retail, Banking and Finance, Healthcare, Media, Marketing, Education, Agriculture, and Manufacturing sectors. I have 7+ years of experience in delivering Data Science & Analytic solutions of which 6+ years of experience is client facing. I have Lead & Managed, several large teams of Data Scientists, Data engineers, ML engineers, Data analysts & Business analysts. Also, I am experienced with Technical/Management skills in the area of business intelligence, data warehousing, reporting and analytics holding Microsoft Certified Power BI Associate Certifications. I have worked on several key strategic & data-monetization initiatives in the past. Being a certified Scrum Master, I deeply practice agile principles while focusing on collaboration, customer, continuous improvement, and rapid delivery.

Acknowledgments

I take this opportunity to acknowledge & thank — Langfuse,Langchain, Meta, Gradio,PromptCloud,DataStock and others. Special mention to “Tarun Mamidi”

References

Langfuse

Langfuse is an open source LLM engineering platform. It includes observability, analytics, and experimentation…

langfuse.com

LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques - Confident AI

In this article, I'll share how to evaluate LLM chatbots using the latest LLM conversational metrics.

www.confident-ai.com

GitHub - langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics…

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets…

github.com