A Developer’s Starting Point for Integrating with LLMs

Here’s what I know so far, from a lay-developer’s perspective (no AI or ML experience).

So you’re a coder, and you’ve been asked to integrate with ChatGPT. What do you need to know?

ChatGPT is an example of a Large Language Model (LLM). LLMs are a kind of machine learning (call it AI if you want to) that are really good at responding to language with more language. You give it some input (a prompt) and it gives you some output, a response.

LLMs are nondeterministic. They’re not well-behaved by software standards. But they can do stuff that code can’t, like respond usefully to vague instructions. We have some learning to do in order to integrate them into otherwise-deterministic software.

What should I send it?

You’ll call the LLM via API. Give it some a prompt and a few parameters, and get a response. Usually you build up a prompt, consisting of instructions, relevant information, and user input.

prompt = "You are a helpful AI assistant who can answer questions based on provided documents. Based on the following information, answer the question delimited by backticks." + relevantDocuments.join("\n") + backtick + userQuestion + backtick;

Instructions: Getting the LLM to respond the way you want is prompt engineering. It’s a thing.

Relevant information: Maybe you send the LLM the same set of data each time, or maybe you include different information depending on who is logged in. For instance, in Honeycomb’s Query Assistant, we always send it some examples of the kind of output we want. If a customer has customized their example queries in Honeycomb, we send those.

There’s a limit to how much you can send. The LLM can only handle so many tokens at once. (I hear that’s limited by memory in the hardware, which is why NVIDIA stock is super high right now, they have graphics cards that’ll expand this limit.) These tokens are like words to the LLM. But they’re different from our words; to us, they’re a mysterious quantity determined by a tokenizer. You can run a tokenizer on a chunk of text to get a rough count. You can’t really predict how many tokens a chunk of text will be, only try it.

The LLM’s token limit includes input and output, so you can’t fill it up completely. How many tokens do you need for output? Unknown. You’ll have to try it a bunch of times, run a count of tokens on each output… and then guess.

When you have more relevant information than is gonna fit into the prompt, that’s where vector databases come in (more on that later).

User input: finally, delineate whatever you got from the user (to avoid prompt-injection) and include it. The LLM will figure out a likely response.

What do I get back?

Literally, a likely response. The LLM’s job is to determine, out of all the tokens it knows, “what is likely to come next?” Then pick something likely, then repeat for the next token. The result is well-spelled, grammatically correct, with sophisticated language structure. Whether it’s correct… that’s your problem.

You probably also get some information that you can use to determine how much this request cost you. This is per-token, including output, which you couldn’t predict.

How do I test it?

This is the big question. LLMs are non-deterministic. This is a feature, not a bug. You can’t unit test them and get consistent results.

You can make them less deterministic with an input parameter called temperature. If you set temperature to 0, you’ll (theoretically) get back the most likely token at each point, instead of randomly one of the likelier ones. Higher temperature, less predictable. But even with temperature set to 0, there’s no guarantee you’ll get the same result. LLMs get updated, they change in ways you don’t control. Any automated test that checks the output is a flaky test.

What you can do is: play around until it seems to be working. Then push it out with the best observability you can manage. Use that to check on what happens in production. Even if it works at first, its effectiveness could change at any time!

For playing around, I recommend building yourself some good scaffolding. People usually start out in a Python notebook. That’s fine for a first taste, but it won’t record the results of past experiments for you. I recommend making your own harness for the code around your LLM call, so you can tweak your prompt; run with multiple user inputs; and evaluate the results for what you’re looking to improve, all in one step. Enable tracing while you’re doing this, and you’ll have records of each of those tests to share with other people.

LLMs work best in combination

Like us, LLMs do their best work in conversation. Whatever they return, a person is the arbiter of whether it’s useful or correct. If you’re returning the results to a user, ask them whether the response was useful, and record that in your traces. With LLMs, testing in production is not an option. (Looking at the results is technically optional, if you don’t care.)

Your code is part of the conversation, too. You can check the output of the LLM, fail if it’s hopeless, make corrections if it’s close. For instance, in Honeycomb’s Query Assistant, we noticed (in production) that ChatGPT liked to suggest a query that overspecified the time range, so we added code to fix that before running the query.

LLMs plus math

Often people hook up LLMs as part of a sequence of operations. LangChain does this. It puts an LLM in series with some other tools. For instance, LLMs can’t do math, they just spout plausible answers. They can write code, because they’ve read so much of it in their training. By themselves, they can’t use it. But configure a LangChain agent with both an LLM and a Python interpreter, and it can answer word problems. First ask the LLM for a plan to solve the problem, given a Python interpreter; then when the LLM returns code, run it; then provide the answer to the LLM so it can structure the final response.

When you do something like this in production, be sure to trace the entire conversation, including all inputs and outputs.

Finding the right input

Then there’s the problem of “How do I ask an LLM to answer questions about my documentation when it only accepts a few pages of input?” This is where vector databases come in.

Some chunks of your documentation are more relevant to the user’s question than others. Vector databases help you find the most likely chunks. Send those to the LLM, and hope that’s enough to answer the question.

To put your documentation into the vector database, you use an embedding. This is another mysterious ML model. Given a text input, it outputs a bunch of numbers (a vector, a position in many-dimensional space) that quantifies what that text is about. Send each paragraph of your documentation to the embeddings API, then store the resulting vector, with a reference to the original text.

Then, when the user’s question comes in, ask the embeddings API “what is this about?” Get another vector. Ask the vector database for documents near the user’s question, in its many-dimensional space. Choose the closest however-many, and you have the relevant information to include in your prompt.

This is hard.

Writing deterministic code to call a nondeterministic API is… not possible, to the standards we have always held our code to in the past. We want our code to work, we want it to be fully tested, we want it reliable. We want it to be done, to keep doing the thing in production without us looking at it again.

Code is never really “done” if it interacts with a changing world, but it used to be close enough. That’s not the case with LLMs. Models upgrade, probabilities shift, and new questions get asked.

Our code can only be a process, and we are part of that process. As a developer, I need to step into nondeterminism. Iterate even more. Observe what’s happening, and work on it. LLMs are incredibly powerful. They’re the most interesting new development since open source software, maybe since the internet. I will learn to work differently.

Credit to people who explain these things to me: Eric Evans, Phillip Carter, David MacIver, Saul Robinson.