Exploring “Langfuse” — An open-source LLM Engineering Platform
Traces, evals, prompt management, playground, metrics and many more cool features to debug and improve your LLM application
In this blog post, I demonstrate on building a Content Generation & Summarization Service Application using Llama + Langfuse and further guide on how to inspect traces, carry evaluations, capture human-feedback and calibrate our LLM application.
Access this complete tutorial code from here
What is Langfuse — “It is an open-source engineering platform for observability, evaluations, prompt management, etc. ”
Core features of LANGFUSE
- Langfuse Dashboard
With langfuse dashboard you can see all important metrics to monitor your LLM application that includes overall volume usage by model or token types, cost breakdowns by user, latency distributions and quality metrics tracing is at the very core of platform.
- Tracing
It will capture all the interactions with your LLM-application. For e.g. in a typical RAG application you will see traces of the interaction and length view which highlights which documents actually went into producing the answer as they were fetched from our corpus, how the embedding workflow worked and then the context and to understand how it reached the response.
- Evaluation
Langfuse interface evaluation is super important to understand how the LLM-powered application actually works. You can use different evaluation methods for production applications, you can capture user-feedback and any comments via the SDKs. You can configure Langfuse to run element, judge evaluators on all new created traces, these evaluators can run on the whole trace input and output or also evaluate a section of the trace, etc.
- Prompt Management
In simple layman terms — Langfuse prompt management is a Prompt Content Management System. Langfuse prompt management helps to version edit prompt to production and instantly roll back prompts, thereby everyone on the team can contribute and see which prompts are used without touching the code- base. Langfuse prompts are linked to traces as we can see which prompt was used when a specific good or bad trace has happened going back to the
prompt.
- Datasets and Experiments
Simply, datasets can be used in development to test your LLM applications. A data set is a test set of example inputs and outputs. Langfuse datasets & experiments can be used to test which of our prompt versions is actually better to make a good decision before moving them to production. Say if there are some new ideas that you want to test every week and you need to do prompt iteration, a new model checkpoint or a new agent strategy or a new retrieval method or change configuration etc this is where experimentation comes into play.
- Playground
With the LLM playground, you can now test and iterate your prompts directly in Langfuse. Either start from scratch or jump into the playground
Note:- Langfuse can be either self-hosted or alternatively you can also use their cloud version “hobby free plan” with certain limits but which still should be sufficient for doing any Proof-Of-Concept or pre-pilot testing's, rapid prototyping, etc.
The free plan includes —
- All platform features (with limits)
- 50k observations / month
- 30 days data access
- 2 users
- Community support (Discord & GitHub)
Note:- In terms of Security aspect they claim that “Langfuse Cloud is SOC 2 Type II and ISO 27001 certified and GDPR compliant.”
Enough of talking talking as — “Talk is cheap. Show me the code.” — Linus Torvalds.
Let’s deep-dive into our demo illustration.
Llama + Langfuse — Trace, evaluate, iterate your local LLM experiments (Walk through)
We will use Llama via ollama and langfuse (cloud version) for experiment tracking to test the LLM application built locally using open-source LLM and introspect its effectiveness before piloting it into pre-production/productional settings. A representational workflow is provided below, where a user sends query to the “LLM” on our local computer and Langfuse platform that captures all the interactions — i.e., all the different traces, the different spans, the different components, etc.
Steps Overview:-
a) Setup Langfuse account [cloud or self-host]
b) Setup Llama
c) Running the LLM-powered application
A] Setting up Langfuse
For sake of quick prototyping and testing, I used the Langfuse Cloud version that is hosted & fully maintained by their team, but alternatively one can self-host the platform locally using Docker Compose. For setting Langfuse locally, follow this documentation.
To use the Cloud version, navigate to the url — https://cloud.langfuse.com/ which will load the sign-up page as shown below where we need to create our account. If “US” region, create account accordingly.
Once you create your account, we need to Setup our account by going through 1–4 steps, where we provide organization name, organization members info, creating project and setting up traces.
Note:- The key above can only be viewed once. You can always create new keys in the project settings as shown below.
Navigate to API Keys, to setup Langfuse API key as illustrated below.
Finally, if you go back to your home page that should now list your created project as seen below.
B] Setting up Llama
Llama 3.2 is available to run using Ollama which is an open-source tool that lets you run Large Language Models (LLMs) locally on your computer.
To get started, Download Ollama, according to your OS. In Windows, open the PowerShell and use the following command, to download & run the model.
ollama run llama3.2
Since, I have already downloaded it on my machine, I skip the above command and check the available downloaded models on my local machine with the command “ollama list” as shown below.
C] Creating Content Generation & Summarization Demo Application
First let’s create a virtual environment using conda
conda create -n langfuse_venv python=3.10
or using the Virtualenv
py -3.10 -m venv langfuse_venv
followed by activating the created Virtual Environment, and next install all the libraries as highlighted in yellow below [ollama; python-dotenv; langfuse]
Next, git clone the repo
In the cloned repo — update your Langfuse API keys in the file “.env” that we got from while setting up Langfuse (refer snippet as illustrated below)
Note:- According to your region, comment/uncomment accordingly. Once you setup your API keys, then we are ready to test our built LLM-application. From the cloned repo, run the python script & enter a topic for content generation & summarization as shown below.
Once you provide a topic (eg: Generative AI) the user input is send to “LLM” (Llama3.2) configured on our local machine via Ollama (Refer the example snippet below)
Langfuse captures all the interactions with our LLM-powered application as shown below. Navigate to your Langfuse account via https://cloud.langfuse.com/ and in your project on the langfuse dashboard you can see all traces, important metrics to monitor your LLM application that includes — overall volume usage by model or token types, cost breakdowns by user, latency distributions, quality metrics, etc.
Next, we can further explore all the traces of from our LLM-application that went into content generation, content summarization, content evaluation as seen below.
Note:- Currently, only OpenAI models are supported, so you will require OpenAI API key for setting up evaluators/templates. You can create custom evaluators, or use existing evaluators, create new templates or use any existing template configured as shown by their respective Langfuse UI interfaces below as per one’s use-case and evaluation metrics requirements by clicking “New template”/ “New evaluator” options.
Below, you can see an example of the content generated & it’s summarization provided for the user-inputted topic, along with the capturing of the human-feedback.
You can further inspect the trace in Langfuse for more details which shows the user captured feedback highlighted in red as seen below.
There are many many more things that I wish to illustrate with Langfuse but for the sake of the time & length, I will windup here. For more detailed in-depth know-how of evaluations, refer this document. For more on calibration of your LLM-application, refer this docs.
You can find the complete code on GitHub.
Summary
Languse is a open-source “LLMOps” product that can be used for tracing & evaluations of your LLM-powered applications. It is a scalable advanced observability platform that also provides provision of incorporating human-feedback, examine different testing strategies, calibrating your application, etc. for better performance and reliability which is very instrumental for shipping a successful & robust LLM-powered applications in your productional settings.
Thank you a lot for reading this article. I hope this article was insightful for you. If you do like this blog post, pls encourage me to publish more contents with your support & love by a clap 👏
All the images are produced by me unless otherwise stated.
Contact Me
You can reach me at ajay.arunachalam08@gmail.com or connect me through Linkedin
“Knowledge is Power ” — So, always constantly keep updating yourself & developing by learning and sharing!!! Check my Git Repo here
About Me
I am a Cloud Solution Architect & Machine Learning Specialist. In the past, I have worked in Telecom, Retail, Banking and Finance, Healthcare, Media, Marketing, Education, Agriculture, and Manufacturing sectors. I have 7+ years of experience in delivering Data Science & Analytic solutions of which 6+ years of experience is client facing. I have Lead & Managed several large team of Data engineers, ML engineers, Data Scientists, Data analysts & Business analysts. Also, I am experienced with Technical/Management skills in the area of business intelligence, data warehousing, reporting and analytics holding Microsoft Certified Power BI Associate Certifications. I have worked on several key strategic & data-monetization initiatives in the past. Being a certified Scrum Master, I deeply practice agile principles while focusing on collaboration, customer, continuous improvement, and rapid delivery.