End-to-End LLM Monitoring, Observability, Evaluation and Calibration — Part 2

10 min read4 days ago

Hands-on demo illustration with building LLM-powered applications & inspecting it

The accompanying notebook and code for the app are here

Refer, the first part of this blog series here. In this second part of the blog series, I will demonstrate how our built simple travel chatbot can now be tailored for more specific use-case of answering user’s travel related queries and recommending holiday packages as a Travel Planner AI Assistant. Also, further, I illustrate on how to calibrate our built Travel Planner AI Assistant chatbot.

Note:- The prerequisites needed for you to try this demo hands-on are:-

A) Setting up Llama B) Setting up Langfuse [self-host/cloud], which I have already covered it in detail, in my previous post. So, pls do check it out.

Part Three:

Building Travel Assist Planner AI Chatbot using Ollama + Gradio + Langfuse

Here, comes the more challenging & fascinating part!!! In our first part of the blog series, we saw how to build a chatbot application with LLM and evaluated it with user-feedback, scores, etc. Now we will focus on building a domain-specific chatbot for a travel company as an illustration example. So, how can we make this chatbot as a travel assist planner? There are several ways to address this problem. You can address this as a RAG problem depending on the data availability. For crisp overview of RAG in a very simple layman terms, refer this post. You can craft effective prompts using templates to build a travel specific planner bot. You can also fine-tune including standard fine-tuning, prompted fine-tuning etc. And, so on so forth…

The travel assist planner chatbot that we aim to build here — will leverage LLM to converse with user to identify their specific needs and then recommend the holiday packages. For this, we will use a specific dataset containing information about holiday packages (Package name, destination, no. of days, sightseeing, start city, places covered, price, hotel details, rules, etc.), and supplement it with a rule-based approach to recommend best matching holiday package as per the user’s requirement.

Dataset overview:-

Total Records Count: 4,57,541
Domain Name: makemytrip.com
Date Range: 01st Sep 2019–30th Sep 2019
File Extension : csv
Available Fields:
— Uniq Id,
— Crawl Timestamp,
— Package Name,
— Page Url,
— Package Type,
— Company,
— Destination,
— Itinerary,
— Places Covered,
— Travel Date,
— Hotel Details,
— Start City,
— Airline,
— Flight Stops,
— Onwards Return Flight Time, -
— Meals,
— Price Per Two Persons,
— Per Person Price,
— Sightseeing Places Covered,
— Initial Payment For Booking,
— Cancellation Rules,
— Date Change Rules

High-level Workflow:-

Chatbot initiates conversation with the user.
Chatbot keeps asking questions until user requirements are identified & captured.
Chatbot performs matching & rule-based extraction.
Chatbot recommends relevant holiday package and summarize it.

So, let’s get started —

Travel Planner AI Assistant — System Design

Before, we directly start thinking about the Gradio chatbot implementation, let’s take a step backwards to workout the workflow, design it, develop and test it. I will develop it with Google Colab, test it, and once we finalize it, we can bundle it as an APP.

We will first start with initial conversation, where we use prompt engineering and chain of thought reasoning that will enable the chatbot to keep asking questions until the user requirements have been captured in a dictionary format. We also includes Few-Shot Prompting(sample conversation between the user and assistant) to align the model about user and assistant responses at each step.

The initialize_conversation()function initializes the variable conversation with the system message.

Building our next set of functions —

get_chat_model_completions(): This takes the ongoing conversation as the input and returns the response by the assistant.

Typically, whenever the chatbot is interacting with the user, all the conversations & response should be moderated to identify any inappropriate content. Let’s look at the function that can help with it.

moderation_check(): This checks if the user's or the assistant's message is inappropriate. If any of these is inappropriate, one can add a break statement to end the conversation.

intent_confirmation_layer(): This function takes the assistant's response and evaluates if the chatbot has captured the user's profile clearly. Specifically, this checks if the following properties for the user has been captured or not [Destination; Package; Origin City; Duration; Budget]

dictionary_present(): This function checks if the final understanding of user's profile returned by the chatbot is a Python dictionary or not. We also use another function that will help us with information extraction — extract_dictionary_from_string(): This function takes in the output of the response and extracts the user requirements dictionary

Then, we take in the output of the previous layers, i.e., the user requirements, which is in the format of a Python dictionary, and extract the relevant holiday recommendations. To do so, we use the function compare_holiday_with_user()

This function compares the user's profile with the different holiday packages from the dataset and provides with the top recommendations.

It will perform the following steps —

It will take the user requirements dictionary as input
Filter the holidays based on criteria.
Match the user requirement with holiday dataset.
Return the matching holidays.

Finally, we come to the product recommendation layer that uses the function initialize_conv_reco().

It takes the output from the compare_holiday_with_user function in the previous layer, format holiday products into a readable summary with the function format_holiday_summary and provides the recommendations to the user.

Note:- I have just covered few key basic functions to understand the high-level workflow.

Refer, the complete implementation of the Travel Planner here.

Run Travel Planner AI Assistant Chatbot

After implementing all methods and testing them using Jupyter Notebook, we can now put everything together as a Gradio Chatbot application

And, launch our application as seen below.

Image by author: Running Travel Planner AI Assistant chatbot

Open the provided local URL in your browser which should launch our Travel Planner AI Assistant bot application as seen below.

Now let’s test our travel planner with a user query —

“Hi, My name is Ajay. I want to travel to Cochin. My max budget is 36000 INR and I wish to stay for 7 nights. I am looking for some Standard holiday package and my origin travel place would be from Mumbai”.

It provides a response message as seen below — “Based on your preferences, here are the recommended holiday packages: 1. A Relaxing holiday to Kerala — Free Speed Boat Ride: Visit Cochin, Munnar, Thekaddy, Allepey, Kovalam and Poovar, ₹16432 per person”

Now let’s provide a user-feedback for this response — with a “like”, providing a “comment”, providing a “satisfaction rating” and also providing some “additional comments” as shown below.

Image by author: Launched Travel Planner AI Assistant chatbot capturing user-feedback for the response

Once we record the human feedback, you can notice as seen below that for our current trace ID: “8f35f035-….” with all the feedback captured and associated to Langfuse with our current trace ID.

Image by author: Capturing feedback for Travel Planner AI Assistant chatbot

All the interactions with our travel planner bot are captured. One can see the traces along with user-feedback score as seen below.

Image by author: Inspecting trace and user feedback score of Travel Planner AI Assistant chabot in Langfuse

The specific code for the APP part can be found here.

Part Four:

Calibrating Travel Planner AI Assistant chabot with Langfuse Datasets & Experiments

Now, for our Travel Assistant Planner AI chatbot that we built, we’ll iterate on systems prompts with the goal of getting our expected output (i.e., respond with the travel iternary that includes recommended Holiday Package Name, Attractions Covered, and Price per person), given the inputs (that included ‘Destination’, ‘Package’, ’Origin’, ‘Duration’, and ‘Budget’). For this, we use Langfuse datasets, to store a list of example inputs and expected outputs. For a simple illustration purpose, I have taken just 1 or 2 examples. For more precise overview of “Datasets & Experiments” which is one of the core features of Langfuse, refer this tutorial.

We will run experiments for our application & further trace with the Langfuse SDKs.

To proceed with running experiments, we first need to create a dataset, add items into the dataset, upload to Langfuse, run the experiment & evaluate it with Langfuse decorator.

Create a dataset

langfuse.create_dataset_item(
        dataset_name="trip_planning_dataset")

Add Items

Add local items into our created Langfuse dataset — “trip_planning_dataset”.

Note:- Alternatively you can add items via the Langfuse UI also.

local_items = [
    {
        "input": {
            "Destination": "Manali",
            "Package": "Deluxe",
            "Origin": "New Delhi", 
            "Duration": "4",
            "Budget": "15000 INR"
        },
        "expected_output": {
            "Holiday Name": "Experiential Manali from Chandigarh (Candid Photography)",
            "Major attractions": "Vashishth Kund | Hadimba Temple | Tibetan Monastery | Personal Photoshoot in Manali | Solang Valley",
            "Price": "6023"
        }
    },
    {
        "input": {
            "Destination": "Jaipur",
            "Package": "Luxury",
            "Origin": "New Delhi",
            "Duration": "2", 
            "Budget": "12000 INR"
        },
        "expected_output": {
            "Holiday Name": "Exotic Jaipur",
            "Major attractions": "City Palace | Hawa Mahal | Jantar Mantar | Amer Fort | Mehrangarh Fort",
            "Price": "10023"
        }
    }
]

Upload to Langfuse

for item in local_items:
    langfuse.create_dataset_item(
        dataset_name="trip_planning_dataset",
        input=json.dumps(item["input"]),  # Convert input dictionary to JSON string
        expected_output=json.dumps(item["expected_output"])  # Convert expected_output dictionary to JSON string
    )

Evaluating using Langfuse decorator — `@observe()`

@observe() decorator makes it easy to trace our application.

# Langfuse decorator
from langfuse.decorators import observe, langfuse_context

@observe()
def run_my_custom_llm_app(input_data, system_prompt):
    if isinstance(input_data, str):
        input_dict = json.loads(input_data)
    else:
        input_dict = input_data    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": json.dumps(input_dict)}  # Ensure the input is correctly formatted
    ]
    completion = ollama.chat(model="llama3.2:latest", messages=messages).message.content
    return completion

Experiment runner

The experiment runner runs our application on each item in the dataset and evaluates the output.

# we use a very simple example eval here
def simple_evaluation(output, expected_output):
  return output == expected_output

def run_experiment(experiment_name, system_prompt):
  dataset = langfuse.get_dataset("trip_planning_dataset")
 
  for item in dataset.items:
    # item.observe() returns a trace_id that can be used to add custom evaluations later
    # it also automatically links the trace to the experiment run
    with item.observe(run_name=experiment_name) as trace_id:
 
      # run application, pass input and system prompt
      output = run_my_custom_llm_app(item.input, system_prompt)
 
      # optional: add custom evaluation results to the experiment trace
      # we use the previously created example evaluation function
      langfuse.score(
        trace_id=trace_id,
        name="exact_match",
        value=simple_evaluation(output, item.expected_output)
      )

Run experiments

Now, we can easily run experiments with different configurations to explore which yields the best results.

from langfuse.decorators import langfuse_context

run_experiment(
    "directly_ask",
    "The user will input their travel requirements that includes 'Destination', 'Package','Origin', 'Duration', and 'Budget', respond with the travel iternary that includes Holiday Package Name, Attractions Covered, and Price per person."
)
run_experiment(
    "asking_specifically",
    "Hi, I want to travel to a destination"
)
run_experiment(
    "asking_specifically_1st_try",
    "The user will input where they want to travel"
)
run_experiment(
    "asking_specifically_2nd_try",
    "The user will input their travel requirements"
)run_experiment(
    "asking_specifically_3rd_try",
    "You are a travel company AI assistant. Only respond to travel-related queries. The user will input their travel requirements"
)# Assert that all events were sent to the Langfuse API
langfuse_context.flush()
langfuse.flush()

One can verify the average scores per experiment run, browse each run for an individual item, look at traces for debugging issues, etc. As shown below, we can inspect the created dataset, by navigating through the Datasets option. You can notice the trace of the “directly_ask” and “asking_specifically_3rd_try” experiments run.

Image by author: Created Trip planning dataset uploaded to Langfuse

We can further inspect the ran experiment trace in detail to introspect our application that ran on each item in our created dataset and evaluated output, as seen below.

Image by author: Trip planning dataset run trace inspection in Langfuse

Note:- We have used a very simple basic evaluation — “exact match evaluation” which might not be most appropriate evaluation for our use-case, and also we have just used few example into our dataset, so hence you can observe that exact_match score returned is 0.0. We can further calibrate and provide more few-shot examples into our datasets & run the experiments with different configurations, to test the performance of our travel chatbot and find out the optimal one.

If you go in your project on the Langfuse dashboard you can see all traces, important metrics to monitor & evaluate your LLM application that includes — overall volume usage by model or token types, cost breakdowns by user, latency distributions, quality metrics, etc as seen below.

Image by author: Langfuse Dashboard interface showing all the traces/scores/cost etc.

The specific code for this part can be found here.

This concludes our final part of the tutorial series. Thank you for reading.

I hope you learned something today. If you liked it encourage me to publish more contents with your support & love as a clap 👏

Connect with me

You can reach me at ajay.arunachalam08@gmail.com or connect me through Linkedin

“Knowledge is Power” — So, always constantly keep updating your knowledge & upskill by learning and sharing!!!

Check my GIT REPO — here

About Me

I am a Cloud Solution Architect & Machine Learning Specialist. In the past, I have worked in Telecom, Retail, Banking and Finance, Healthcare, Media, Marketing, Education, Agriculture, and Manufacturing sectors. I have 7+ years of experience in delivering Data Science & Analytic solutions of which 6+ years of experience is client facing. I have Lead & Managed, several large teams of Data Scientists, Data engineers, ML engineers, Data analysts & Business analysts. Also, I am experienced with Technical/Management skills in the area of business intelligence, data warehousing, reporting and analytics holding Microsoft Certified Power BI Associate Certifications. I have worked on several key strategic & data-monetization initiatives in the past. Being a certified Scrum Master, I deeply practice agile principles while focusing on collaboration, customer, continuous improvement, and rapid delivery.

Acknowledgments

I take this opportunity to acknowledge & thank — Langfuse,Langchain, Meta, Gradio,PromptCloud,DataStock and others. Special mention to “Varun Sehrawat”

References

Langfuse

Langfuse is an open source LLM engineering platform. It includes observability, analytics, and experimentation…

langfuse.com

LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques — Confident AI

In this article, I’ll share how to evaluate LLM chatbots using the latest LLM conversational metrics.

www.confident-ai.com

Travel Listing From MakeMyTrip

Travel listing from MakeMyTrip

www.kaggle.com

Build a Travel Assistant Chatbot with HuggingFace, LangChain, and MistralAI

Learn to build a Travel Assistant Chatbot using HuggingFace, LangChain, and Streamlit for seamless trip planning and…

www.analyticsvidhya.com

GitHub — langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics…

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets…

github.com

End-to-End LLM Monitoring, Observability, Evaluation and Calibration — Part 2

Part Three:

Building Travel Assist Planner AI Chatbot using Ollama + Gradio + Langfuse

Travel Planner AI Assistant — System Design

Run Travel Planner AI Assistant Chatbot

Part Four:

Calibrating Travel Planner AI Assistant chabot with Langfuse Datasets & Experiments

Create a dataset

Add Items

Upload to Langfuse

Evaluating using Langfuse decorator — `@observe()`

Experiment runner

Run experiments

Connect with me

Acknowledgments

References

Langfuse

Langfuse is an open source LLM engineering platform. It includes observability, analytics, and experimentation…

LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques — Confident AI

In this article, I’ll share how to evaluate LLM chatbots using the latest LLM conversational metrics.

Travel Listing From MakeMyTrip

Travel listing from MakeMyTrip

Build a Travel Assistant Chatbot with HuggingFace, LangChain, and MistralAI

Learn to build a Travel Assistant Chatbot using HuggingFace, LangChain, and Streamlit for seamless trip planning and…

GitHub — langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics…

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets…

Written by Ajay Arunachalam

No responses yet

End-to-End LLM Monitoring, Observability, Evaluation and Calibration — Part 2

Part Three:

Building Travel Assist Planner AI Chatbot using Ollama + Gradio + Langfuse

Travel Planner AI Assistant — System Design

Run Travel Planner AI Assistant Chatbot

Part Four:

Calibrating Travel Planner AI Assistant chabot with Langfuse Datasets & Experiments

Create a dataset

Add Items

Upload to Langfuse

Evaluating using Langfuse decorator — @observe()

Experiment runner

Run experiments

Connect with me

Acknowledgments

References

Langfuse

Langfuse is an open source LLM engineering platform. It includes observability, analytics, and experimentation…

LLM Chatbot Evaluation Explained: Top Metrics and Testing Techniques — Confident AI

In this article, I’ll share how to evaluate LLM chatbots using the latest LLM conversational metrics.

Travel Listing From MakeMyTrip

Travel listing from MakeMyTrip

Build a Travel Assistant Chatbot with HuggingFace, LangChain, and MistralAI

Learn to build a Travel Assistant Chatbot using HuggingFace, LangChain, and Streamlit for seamless trip planning and…

GitHub — langfuse/langfuse: 🪢 Open source LLM engineering platform: LLM Observability, metrics…

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets…

Written by Ajay Arunachalam

No responses yet

Evaluating using Langfuse decorator — `@observe()`