Identifying root-cause of LLMs performance issues with Embedding Analysis

Ajay Arunachalam
6 min read1 day ago

--

Using Meta’s Llama LLM for generating summaries and Arize AI’s Phoenix for analyzing {article} →{generated-summary} pairs & evaluating with ground truth.

What is Phoenix?

Arize AI’s Phoenix is an open-source AI observability platform to experiment, evaluate, and debug your LLM-powered applications. Phoenix can simply run practically anywhere be it your own self-hosted environment or cloud environment, and can be also launched within the Notebook environment.

What are Embeddings?

In a nutshell, Embeddings are a way to turn complex things (like text or images or video or audio) into simple numbers that capture their meaning and relationships. They help machines understand and organize information in a way that’s similar to how humans think.

Why Are Embeddings Useful?

Embeddings help computers understand and work with things like words, images, or even user preferences in a way that makes sense.

For example:-

  • Search Engines: When you type a query, embeddings help the search engine find results that are related to your words, even if they don’t match exactly.
  • Recommendation Systems: If you’ve ever shopped online and seen “You might also like…”, embeddings are often behind those suggestions. They figure out what’s similar to what you’ve already liked or bought.
  • Language Translation: Embeddings help translate words and sentences by understanding their meaning, not just replacing one word with another.

Evaluations using NLP Metrics

Phoenix allows you use traditional NLP Metrics such as ROUGE, BLEU, METEOR and it’s variants on-the-go. Put simply these metrics are used to compare similarities between the predicted text & reference text. For better understanding on these evaluation metrics, refer these references — ref1, ref2

For our evaluation purposes, we will inspect the above metrics and estimate the scores to validate the quality of the generated summaries (predicted) in comparison to the human-written summaries (reference).

Embedding Analysis using Phoenix — Overview

Phoenix do support embedding visualization. It takes the high-dimensional embeddings and reduces it’s dimensionality to a low-dimensional space, followed by clustering them into similar groups. The visualization includes features such as Clustering, Embedding Drift Over Time, and UMAP Point-Cloud, etc. Using this one can then inspect the grey area where our LLM application is performing poor. For further details — check these Embedding Analysis docs.

Use-case Illustration: Generating summaries with OLLAMA + Evaluation & Inspection using Phoenix

The steps mentioned below, provides a high-level overview of how we are going to proceed with this use-case.

Steps Overview:-

  1. Loading curated example dataset.
  2. Generating summaries using Meta’s open-source LLM for our example dataset.
  3. Computing LLM evaluation metrics to estimate quality.
  4. Computing Embeddings for articles and generated summaries.
  5. Creating Schema, Phoenix datasets and Launching the platform
  6. Using the platform’s embedding visualization feature to inspect LLMs performance & issues

The provided notebook below will guide you through each of the above steps. ⚠️ Add your Phoenix API Key. Checkout this quick start guide to create your API key.

The original example dataset can be found here

The columns of our modified dataset are:-

  • prediction_timestamp: timestamp of original LLM-generated summary
  • article: the news article to be summarized
  • summary: the LLM-generated summary created using the prompt template: “Please summarize the following document in English: {article}”
  • ollama_summary: Llama3.2 model generated summary created using the same prompt template: “Please summarize the following document in English: {article}”
  • reference_summary: the reference summary written by a human which is used to compute ROUGE score, BLEU score, and Meteor score, where higher indicating closeness to the reference.

Inspecting root-cause

We will use Phoenix to find the root-cause of your LLM’s performance issue. Once it’s launched, you can open the Phoenix UI in the notebook or in a new browser tab that will show the UI as seen below.

Image by author: Launched Phoenix UI either within Notebook/Opened Separate through browser

The picture above shows the Euclidean distance of the “article vector” and “summary vector” embeddings. We can now explore this lower-dimensional representations of our embedding data to identify clusters of high-drift and performance degradation. It also displays the different NLP scores computed. Further, it also allows us to analyze drift and conduct A/B comparisons of our inference set against our baseline reference set.

You can click on “article_vector” to go to the embeddings view for your prompts (the input news articles). Select a period of high drift. Color your data by the “rougeL_score” dimension. The problematic clusters have low ROUGE-L score in blue, the well-performing clusters have high ROUGE-L score in green. Use the lasso to select part of your data and inspect the prompt-response pairs. Select each clusters in the left panel and look at the prompt-response pairs. Notice that the LLM is doing a good job summarizing the English articles in the green cluster (high ROUGE-L score), but is struggling to summarize Dutch articles in the blue cluster (low ROUGE-L score). We notice that our LLM-powered solution is struggling to summarize Dutch news articles.

Similarly, we can inspect as mentioned above for the dimensions — “Glue_score”, “meteor_score”.

Similarly, you can click on “summary_vector” to go to the embeddings view for your responses (the LLAMA generated summaries).

Let’s inspect a single instance of both — English, and Dutch articles.

Below, provided is a instance of English article that shows the article (Prompt), the generated summary (Response), the reference human summary, and the computed Rouge-L score.

Image by author: Displaying a clicked view of an English article instance within Phoenix UI

On inspecting a particular instance of Dutch article as seen below, you can notice that the LLM struggles to summarize it well reflected by a lower ROUGE-L score

Image by author: Displaying a clicked view of an Dutch article instance within Phoenix UI

We can also peek into the inference set dataframe that shows our estimated NLP metrics as seen below.

Image by author: Example showing records from Inference set with the estimated NLP metric scores

Complete Demonstration — Colab Notebook

This notebook will provide a comprehensive walkthrough in a step-by-step manner for you to easily get started.

Scope of Improvement

There are many things that can be tried to further improvise the model’s performance, not limited too:-

  1. Modify our simple base prompt to generate better responses, trying out few-shot prompting, tuned prompts, etc.
  2. Trying out different summarization strategies.
  3. Fine-tuning model with custom dataset.

That’s all folks! If you liked the article, please clap the article (multiple times 👏), write a comment, and follow me on Medium and LinkedIn

Acknowledgments

  • Credits to Arize AI

References

--

--

Ajay Arunachalam
Ajay Arunachalam

Written by Ajay Arunachalam

AWS Cloud Solution Architect; AWS ML Specialist; Power BI Certified ;Certified Scrum Master https://www.linkedin.com/in/ajay-ph-d-4744581a/

No responses yet