Journey to Deployment: Navigating the LLM Development Process
A Primer across Ideation, Evaluation, Optimization, Monitoring & Beyond
Image generated from a hallucinating GPT 4.
The path from ideation of using an LLM module in your product's workflow to actually deploying is riddled with opportunities, and also rife with potholes and bumps. More importantly, this road is still in construction for everyone. There are new startups, tools and libraries rapidly emerging, iterating and evolving with time.
This post is about first taking each major step of the LLM development process as it stands today and talking about the purpose it serves. Following this, we dissect the status quo, and discuss a developers approach to create an end to end deployable LLM system. We break this end to end LLM workflow development into 4 major phases-
1. The Proof of Concept Driveway - We want to get something up and running quickly, and determine that a Large Language Model is a good choice for our problem statement, because believe it or not : LLMs are not all you need.
2. The Pavement of Evaluations - At this stage, we intuitively know that an LLM is a good approach for our problem statement. We want more rigor now in terms of how we define success. This has to be twofold - business metrics and machine learning metrics. We need to fetch/annotate and have ready a set of samples with ground truths on which we can run these metrics. Before we go out into the road to improve our prototype, we need to quantify what an improved version should accomplish eventually.
3. The Long Winding Road to Optimal Performance - Now that we're out on the road, we need to take our prototype which we created in the driveway, and elevate it to our desired evaluation metrics. We try out different prompts, model providers, data sources, modeling techniques - nothing is out of bounds. We want to achieve our desired evaluation scores.
4. The Monitoring & Feedback Highway - We are cruising on the freeway now. The Language Model pipeline works in an offline static setting. We are confident by virtue of our evals, metrics and extensive prompt and model tuning. It is about time we put it out in the real world. For that, we need to ensure monitoring, set up feedback mechanisms and track real world performance.
On the highway, we are also prepared to iterate and get back to any of the above two steps - we might need to go back to the pavement and adjust our evaluations if our real world performance does not align with our offline evals. Or we might take a U-turn to the long winding road and improve performance further if it is not quite there. This is not the highway to hell, it is the highway to continuous iteration, the backbone of every thoughtfully designed, and reliable ML system.
With the context of journey now set up, let us look at what tooling and framework exists as of today to help us on our path.
The Proof of Concept Driveway
Langchain is one of the most popular open-source frameworks that makes getting started with LLM development easy by allowing developers to try out different models, and prompt templates out of the box. They have built significant layers of abstraction such that the end user does not have to worry about a lot of nitty gritties. They are also in very active development with respect to adding new features and integrations.
Depending on your application, there are also more specific libraries. For example, if you have decided upon a RAG based system for your PoC, you might want to try out LlamaIndex which is geared more towards this. There are also prompt playground tools that allow you to tinker with prompts and try out different providers in a UI based manner to get faster cycles. One that I love is empirical.run which allows you to compare models given different or same prompts, and input samples side by side. They also allow you to share that view with other stakeholders.
The Pavement of Evaluations
This pavement is messy. There are incumbents such as Weights & Biases who have integrated LLM evaluations into their product. Then there's langsmith by Langchain, which is more LLM focussed and also offers their own Datasets offering, which allows you to host a set of evaluation samples on their platform and run experiments on it.
Open source platforms such as ML Flow and TruLens are also a good option to track your metrics. You could also treat your evaluation samples as test cases for your LLM system and use PyTest which seems to work well too.
There are a few important aspects that you must consider when you decide on a tool to choose -
Does the tool allow easily incorporating custom metrics in evaluations
Does it natively parallelize evaluation runs
Does it allow for both individual sample level metrics and aggregate metrics
Does it have a nice user interface to visually debug what happened in an evaluation run
Cost
Another aspect of creating good evaluations is having good samples that make those evaluations. There are free open source data annotation platforms that you could use for doing this annotation faster and more efficiently such as Doccano. You might also use paid products such as Scale AI if the annotation is more involved. You would also want to run data quality checks on the samples you use, and also ensure that these examples are a good representation of what you would run into during test time in production.
In between all this excitement and noise of LLMs, you do not want to forget the basics of ML such as splitting your data into train, validation and test sets. And never ever see the test set if you really want unbiased and reliable results.
The Long Winding Road to Optimal Performance
This is the fun part, and the phase with the most unknown unknowns. Prompt tuning is tricky, fun and frustrating - all at the same time. Be sure to try out all the relevant LLM providers to see if you can reach your desired metrics. You can also try prompt learning such as DsPy or Prompt Learner(self plug) which allow you to define your task, training and evaluation samples, and try to fetch the optimal prompt for you.
You should try different prompt techniques in this phase - Chain of Thought, Tree of Thought, and so on. Focus on adding relevant examples inside the prompt and have good rationales and explanations for them. Ask the LLM to "think step by step" before giving out an answer. Consistently run evals with modified prompts, models, etc and log your results along with the configurations that led to the result.
You can also look at other data elements that could help the Language Model arrive at the right answer, and inject them, either directly or via a retrieval step. Maybe your use case needs some fine-tuning. There are a number of startups and open source libraries in this space that can help, vis-a-vis, predibase, open-pipe, and resources that can enable you to fine tune open source LLMs. Even OpenAi provides a relatively cheap fine tuning platform. With a few hundred labeled samples, you might end up achieving the desired performance through a fine tuned model.
The Monitoring & Feedback Highway
Our system is about to go live. We probably want a phased approach - try feature flagging through LaunchDarkly or Statsig to have a phased rollout of this workflow. Have concrete fallback cases in case your model is not able to give a valid prediction. Set up and continuously track dashboards that monitor both business and ML metrics. Spend some time looking at live samples that are hitting your language model.
Furthermore, set up rigorous logging in order to enable detailed analysis of the performance in production of the system. If the model affects or adds something that is user-facing, let users provide feedback on the quality of the generated result. Datadog, Weights & Biases, Why Labs, Arthur, Deepchecks all provide continuous monitoring of deployed systems and can be extended to your LLM workflow.
It is also critical to enable high developer velocity in case a newer, more capable model comes out(say, GPT 6?). Our system should be resilient and benchmarked consistently in order to allow confidence in upgrading the model endpoints. This includes ensuring that the output of the language model is processed into a structured format by either function calling or Pydantic models or both. This can be done in parallel when the system is running - it is a refactor of the existing pipeline.
We will not stay on the highway forever. Data Drifts, New Models, Stale prompts and adding more use cases will bring us back to either the driveway, pavement or the long winding road. This is good and this is the way it is supposed to be - iterative and fast. The principles remain the same. Good luck and keep developing, step by step.
If you are interested in prompts - hacking away at them, engineering them, keeping up to date with the latest in the space, join us at our discord. We are just getting started, so you support is extra special!
Prompts for Devs