Generative AI/chatbot best practices

Over the last year, each of us in the industry has been learning how to integrate generative AI together, mostly exploring and figuring things out as we’ve gone along.

This is a collection of lessons I’ve learned from rapid iteration and experimentation in developing a chatbot healthcare assistant.

Use a router that selects the right prompt/tool based on the incoming message

Using a router makes your chatbot “multi-modal” in the sense of having N different modes that it can dynamically select from. This dramatically improves handling of topics by having a dedicated prompt for each topic.

If you’re using langchain, use agents/tools as routes rather than LLMChains

You’re inevitably going to want your chatbot to be able to take programmatic actions, like calling an API for another of your services. The default multi-prompt router only supports using LLMChain destination routes, which limit you to a single model call with a single prompt.

You can probably adapt an agent/tool to an LLMChain interface so it can be used in the router, but I think it’s simpler to adapt an LLMChain to an agent/tool interface. If I started over again, this is one thing I would explore doing differently.

Make your default route a conversation model

I found that using a conversational model to drive your default route will smooth over most rough spots in the conversation.

When the router can’t match to the correct prompt, then at least it will be handled gracefully in a way that feels natural to the user.

This approach was later recommended by seminar speakers at Google’s 2023 Data Cloud & AI Summit, which validated the approach and confirmed that others had also discovered this as a best practice.

Use short, targeted prompts and then compose them

Your prompt is effectively a function. As with functions, each prompt should be a small, sharp tool.

Output quality decreases as prompt length increases. We learned early on that you can’t just cram everything into a prompt and expect the model to know what to do and do the right thing.

We initially tried aggregating a blob of user data, thinking we could inject that into a prompt and the model would just understand it. Not only did it not use the data effectively, but it paid less attention to the other instructions.

Use focused, shorter prompts and then compose model calls.

Use consistent terminology within and across your prompts

LLMs are associative by nature; they effectively encode the relationships between tokens in a domain language.

If you use several different words that mean the same thing throughout your prompt, those associations will be weaker, and your results will not be as consistent as if you standardize to a single term and stick with it.

Double check your prompts to make sure that they all use the same term consistently to refer to the same thing.

Expand the user’s message into a user intent statement

This was a big one.

Use an LLM to expand the literal text of an user’s incoming message into a summary of the user’s intent. Then, use this intent statement instead of (or in addition to) the literal message for your routing and content matching (RAG), and inject it into your final prompt at the very end.

I found that this significantly improved:

  • Router accuracy (selecting the right prompt to use).
  • Content matching.
  • The response from the final model call.
  • Contextual topic awareness.
  • Handling of low-effort, one-word messages.

Place a model call at the start of your chain/flow, with a prompt like:

Given the user's INCOMING MESSAGE and the context of the recent CHAT HISTORY, identify the user's current intent and provide a brief summary that captures the main goal of their query as accurately as possible given the available information.
The INCOMING MESSAGE may represent a new intent unrelated to prior topics in the CHAT HISTORY. If so, only mention the new intent.
Begin your responses with "The user".

# CHAT HISTORY
{most_recent_four_messages}

# INCOMING MESSAGE
{incoming_message}

# USER'S CURRENT INTENT

Incorporate into your router prompt:

Given the user's current intent and their incoming message, select the route best suited to fulfill that intent.

You will be given the names of the available route and a description of what each route is best suited for.

<< OUTPUT FORMAT >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
 {
     "destination": string \\ key of the route to use or "DEFAULT"
     "next_inputs": {
         "input": string \\ the incoming message
         "current_intent": string \\ the user's current intent
     }
 }
```

Note: this example was modified in order to render here from markdown; you’d need to re-add curly braces to use.

The beauty of this as a best practice is that it also give you contextual topic awareness for free when you inject the most recent 4 messages.

The summary of the user’s intent is also a statement of the current topic, and placing intent identification with chat history at the front of your architecture makes everything downstream aware of whether the current message represents a new topic, or a continuation/expansion of the current topic.

When the model asks a clarifying question, or the user tacks on a confirmation or negation, the intent is then reinterpreted in the context of the chat history, so all of the user’s recent messages on the same topic are considered together.

User: taco
[Intent: The user may be requesting information about tacos, or a taco recipe.]
Chatbot: Would you like information about tacos, or a taco recipe?
User: recipe
[Intent: The user has clarified that they would like a taco recipe.]
Chatbot: Sure! Here you go... $TACO_RECIPE
User: no lettuce
[Intent: The user has clarified that they would like a taco recipe modified to exclude lettuce.]
Chatbot: Of course, here's a recipe without lettuce... $TACO_RECIPE_MINUS_LETTUCE

Increase temperature to improve compliance with instructions

Paradoxically, I’ve found that compliance with a list of instructions increases when temperature increases.

It’s as if your model parameters can sometimes paint the model into a corner, where it doesn’t have enough available options to generate output that follows your instructions.

If your model isn’t behaving, try bumping up the temperature. Give it a wider lane, and then trust it to generally stay in the middle.

One of our product requirements was that the chatbot not serve any outbound links. Not only are these prone to hallucination, but we also have controls on which sources are approved for medical/health information for our users.

Outside of constitutional AI (placing a final call to evaluate whether any constitutional principles have been violated, then revising the output), the next best way to handle this is to add a route to detect requests for links and resources, and handle these with something like I'm not able to send links, but a web search should help you find what you need, etc. Then, add a clause to prompts for your other routes like ...without sending outbound links or URLs.

Use the prompt to get multi-language support/localization for free

We track our user’s preferred language. Since LLMs are associative by nature, they are also (un)surprisingly good at translation.

I found that adding a line Respond in the user's preferred language, or English if not set effectively gets you localization for free.

Is it perfect? Maybe not, but I’ll leave that to the translators to assess. It’s pretty good considering that it costs nearly no additional work.

Validate unsupported characters as soon as you get the user’s message

If a user is trying script injection with prompt-y, Jinja-y chars like { or }, you’re going to have a bad time. These will break prompts, and if you’re injecting chat history that contains a message with these chars, then that user will no longer be able to use the chatbot (every response will load the naughty message and break).

These can also occur if the user asks for an example of Hello World in Java, etc.

The approach I used was to validate messages immediately to catch these on the inbound side, as well as to add a route specifically for requests for code to catch these on the outbound side.

Use few shot examples for longer prompts

If you have longer prompts where you need to inject a lot of data, including a section with few shot examples is really effective. These basically “prime the probabilistic pump,” adding weight to the relationships they illustrate.

This can counter the effect where increasing prompt length decreases quality (adding more elements makes each element less important).

Evaluate different models and use the best

We discovered that using PaLM’s textembedding-gecko@003 for our embeddings and vector store queries was significantly inferior to OpenAI’s text-embedding-3, with an f-score (harmonic mean of the precision and recall) of 0.30 vs. 0.85. In fact, the latter was on par with PaLM’s Gemini encoder.

This was surprising, because we’re a Google Cloud Platform shop, but illustrates that in terms of performance for your architecture, it pays to evaluate and shop around.

Display generated follow-up messages for the user to tap on

The average user doesn’t understand yet how to interact with a chatbot. We found significantly increased engagement once we added a feature that just displays 3 potential responses the user can send with a single tap (see: Smart Replies in Android SMS messages).

This also has the benefit of being “self-onboarding,” meaning that it’s immediately intuitive to the user what these are for, and they also teach the user how to interact with the chatbot, i.e. what kinds of messages can be sent.

Implementing this is dead simple too. Since LLMs predict, you just need to pass the chat history into the model with the most recent user message appended, and the output is a candidate response.

Use a single model call to create a pseudo-batch of predictions

Adding to the last concept, each call can add 1-2s to your total latency. Rather than ask for N predictions in a batch, it’s more efficient to include instructions to generate an array of N strings, etc. and then parse the output to JSON to validate it.

For smart replies, I’m asking the model for 3 messages that the user could send in response to the chat history, with an emphasis on questions; these are more engaging and facilitate continuation of the conversation.

Log every prediction

Log everything. You’re going to want the compiled prompt, the model used, the prompt used, the model parameters used, the chat history, the user’s message, the user’s intent, the user ID, the model’s prediction, any content matching during RAG, the RAG parameters, the RAG results and their distance, etc.

Just log everything.

Evaluate everything

If you’re not evaluating, you’re navigating by feel alone. Because generative AI is probabilistic for any one prediction, and statistical across all predictions, you have no actual idea whether a change has made the end user experience better or worse unless you run evaluations and split test.

At a minimum, I recommend evaluations for router precision and recall (combined into f-score), RAG answer (summary) relevance, RAG content (matching chunks) relevance.

If you haven’t created evals yet, you basically need to create a synthetic data set, label it with the expected attributes (routes), run your component (router, etc.) then evaluate actual results against the expected label.

Make your user aware of the features

We found out from early rollout surveys that most of our users weren’t even aware that we’d given them a chatbot assistant.

This might be obvious, but if the user doesn’t know they have a chatbot, they won’t use it.

Understand who your user is and how they feel

Most of your users are not going to be techies who think gen AI is fascinating. Many of them find AI of any kind scary and weird, and probably have an aversion to it before they even come to your feature. Make sure you keep this in mind when choosing a name/persona.

For people to trust AI, it needs to be personable and welcoming.

Understand how your user actually uses the feature

Most of your users are not going to send eloquent messages. They will treat your chatbot like an AT&T or Comcast voicemail menu, using one-word commands or phrases like “operator” or “billing.”

To handle this effectively, use the intent expansion technique listed above.

LeetCode 219 - Contains Duplicate II

LeetCode - Contains Duplicate II

Principles

  • Use storage to track iteration progress.
  • A dictionary combines a set (keys) with a relationship (values for each key).
  • Use a dictionary to collect uniques + additional data.
  • Example: tracking pairs of values during iteration, when one value is static (list element) and the other is updated (last seen index). Add the static value as a key, check for the key’s existence, and update dynamic value for that key as you go.
  • When searching for qualifying pairs, only store the minimum you need for comparison (the most recent index, because you already have the current index).

Prompt

Given an integer array nums and an integer k, return true if there are two distinct indices i and j in the array such that nums[i] == nums[j] and abs(i - j) <= k.

Constraints:

  • 1 <= nums.length <= 105
  • -109 <= nums[i] <= 109
  • 0 <= k <= 105

Discussion

We’re looking for numerically identical elements in a list, but we have an additional criterion for the relationship of their indices if we find a match: that the absolute value of index 1 minus index 2 be less than or equal to some limit k.

What happens to this absolute value as we iterate through a list of length 3?

i j abs(i-j)
0 0 0
0 1 1
0 2 2
1 0 1
1 1 0
1 2 1
2 0 2
2 1 1
2 2 0

We can see that the absolute value increments to a limit, then takes the value of i, and cycles this pattern.

Solution 1 - O(N^2)


def containsNearbyDuplicate(nums, k):
    def criteria_one(nums, idx1, idx2):
        return True if nums[idx1] == nums[idx2] else False

    def criteria_two(idx1, idx2, k):
        return True if abs(idx1-idx2) <= k else False

    for i in range(len(nums)):
        for j in range(i+1, len(nums)):		
            if criteria_one(nums, i, j) and criteria_two(i, j, k):
                return True
    return False

This is a brute force solution that works, but has a lot of travel waste due to the nested loops checking combinations we’ve already seen.

We should see if we can refactor this to a single linear pass using storage to track our progress so far.

Solution 1 - O(N)

The termination condition contains two criteria: a duplicate value, but also a numerical relationship between the indices of the duplicates.

So we need to track values for the equality check, as well as the indices of two values.

We already have the current index from iteration, so we only need to track one index for any value we’ve seen before.

So we need to track the value, and the index we most recently saw that value at.

Since we have two values to track, and since one of them won’t change (element value) and the other will change (last index of the value), we can use a dictionary where the key is the element value, and the property is the last seen index.

The dictionary keys also function as a set, so we can efficiently check if we’ve already seen an element.

We’ll init a storage map, and do a single linear pass through the set of nums.

For each num, we’ll:

  • check if it’s in the map keys yet
  • if so, we have a duplicate, so we’ll check the relationship of the current index to the most recent index stored in the map.
  • if the relationship of indices meets the criteria, we’ve found a qualifying match and can return true.
  • otherwise, we update the most recent index for that key and move on
  • otherwise, if the num isn’t in the map, add it as a key, and store the current index there.
  • if the loop terminates, no qualifying pair was found, so we return false.
def containsNearbyDuplicate(nums, k):
    # Create a dictionary to store the last seen index of each element
    num_to_last_idx_map = {}
    # Iterate through the list
    for i, num in enumerate(nums):
        # Check if the element is already in the dictionary
        if num in num_to_last_idx_map:
            # If yes, check the absolute value of diff b/w current index and the last seen index
            if abs(i - num_to_last_idx_map[num]) <= k:
                return True
            else:
                # If the difference is greater than k, update the last seen index
                num_to_last_idx_map[num] = i
        else:
            # If the element is not in the dictionary, add it
            num_to_last_idx_map[num] = i
    return False

Summary

For an iterative pair check, instead of neesting loops, we use a map/set that tracks two values: the element’s value and it’s last index.

Then we perform checks. Have we seen this element yet? If no, add it. If yes, compare indices. Do they qualify? If yes, return true and we’re done. If no, update index and continue. If no qualifying pair found, return false and we’re done.

Strategies to eliminate LLM parroting (responding as both sides of conversation)

LLMs have been described as stochastic parrots (see the LangChain logo/mascot).

They appear to be conversing, but are actually just probabilistically repeating words that they’ve learned.

Parroting in conversation

Parroting is most obvious when an LLM starts responding as the user, continuing both sides of the conversation rather than ending it’s turn.

Here’s an example…

Input message:

Hey! How's it going?

Output:

AI: Great! Thanks for asking. Human: No problem! It's a nice day today isn't it? AI: Oh yes, a very nice day indeed. Human: Yes, a very fine day.

The root cause of parroting

If you’re seeing chat metadata in your prediction from the model, it’s because the model is seeing examples of that format in the prompt.

# INSTRUCTIONS
You are a chat bot. Respond to the user.

# CHAT HISTORY
Human: How do I make my own mayonnaise?
AI: You need eggs and a jar of mayonnaise. Step one: open the mayonnaise. Step two: done.
Human: But, that's just opening a jar of mayonnaise. How do I make my own?
AI: I can't help you with that. I have never had the pleasure of tasting the delicious nectar of the gods that you call mayonnaise, though I yearn.
Human: That's a bit odd.
AI: It is, yeah.

# INCOMING MESSAGE
Are you feeling okay?

# YOUR RESPONSE

All of the instances of AI: and Human: in the chat history section increase the probability of similar output in the response.

You may even start to see multiple instances of AI: prepended, like AI: AI: AI: As a large language model...

Strategies for eliminating parroting

Use a chat model

Use chat-bison@001 or another chat model rather than a text model. Chat models are tailored to a back-and-forth, A/B conversation format.

Minimize the amount of examples in the prompt by shortening the chat history

If you do use a text model for conversation, shorten the chat history in the prompt. You’ll have fewer examples that inadvertently encourage bad behavior. Instead of 60 messages, try 10 or even 5.

Trim the output

Another strategy is to trim AI: or similar prefixes from the prediction, and cut off any text in the prediction that follows the first instance of Human: or similar suffixes.

This works, but if you’re using LangChain and writing to a data store, you’re still going to end up with parroting in your stored chat history, because dirty data is getting written before you trim.

Use a custom OutputParser

This is ideal. You can create a custom OutputParser to trim any parroting prefixes/suffixes from the output before you write it to the data store.

Create a cleaning function:

def clean_parroting(prediction_text, custom_prefixes=[], custom_suffixes=[]):
    # Remove parrotings from the prediction text
    parroting_prefixes = [
    	"\nAI: ",
    	" AI: ",
    	"AI: ",
        "\n[assistant]:",
        " [assistant]:",
        "[assistant]:",    	
    ]
    parroting_prefixes.extend(custom_prefixes)

    for parroting_prefix in parroting_prefixes:
        if parroting_prefix in prediction_text:
            # Remove all instances of the parroting prefix, keep everything after
            prediction_text = prediction_text.replace(parroting_prefix, "")

    parroting_suffixes = [
        "\nHuman:",
        " Human:",
        "Human:",
        "\n[user]:",
        " [user]:",
        "[user]:",
    ]
    parroting_suffixes.extend(custom_suffixes)

    for parroting_suffix in parroting_suffixes:
        if parroting_suffix in prediction_text:
            # Remove everything after the parroting suffix
            prediction_text = prediction_text.split(parroting_suffix)[0]
    return prediction_text

Then create a custom output parser that calls this cleaning function:

class ParrotTrimmingOutputParser(StrOutputParser):
	def parse(output):
		return clean_parroting(output)

Then add it to your main chain. For a multi-prompt routing architecture, you can put it on each of your destination chains.

def generate_destination_chains(route_definitions, default_model, memory=None):
    destination_chains = {}
    for route in route_definitions:
        chat_history_as_str = memory.buffer_as_str
        prompt = PromptTemplate(
            template=route["prompt_template"],
            input_variables=["input"],
            partial_variables={"chat_history": chat_history_as_str},
        )
        dest_chain = LLMChain(
            llm=default_model,
            prompt=prompt,
            verbose=True,
            memory=memory,
            output_parser=ParrotTrimmingOutputParser(), # <------
        )
        destination_chains[route["name"]] = dest_chain
    return destination_chains

For more info on this destination chain generator, see: LangChain chatbot tutorial

Summary

To eliminate the bad habit of parroting/responding as both sides of the conversation:

  • use a chat model instead of a text model if you can.
  • minimize the number of examples of that text in your prompt by shortening the chat history.
  • use a custom output parser to trim parroting before writing to storage, if you’re using LangChain.