Tesla’s FSD v12 Information from Tesla Engineers

I’m a huge Tesla FSD nerd. The software fascinates me.

I got to try FSD v12.3.2.1 for the first time as a passenger yesterday, and it feels completely different in person than it does watching Youtube videos about it. It is very smooth, very human-like, and very boring. If you didn’t know it was on, you typically wouldn’t even notice as a passenger.

I just got a chance to speak with some of the engineers at the Sunnyvale showroom today. I spoke with Viraj (reliability engineering) and Jay (ADAS engineering) and they answered my questions.

My takeaways:

Aside from the model’s weights, there is no local configuration or adjustment on the vehicle (or so they claim)

FSD Beta testers have commonly reported that a particular release “sucks” or is worse than the prior release, and then after about a week they report that it “suddenly got better.”

This only happens to some of the beta testers, and the others are like “it’s great for me, I don’t know what you’re talking about.” This also occurs within a single geographical region, so it’s not really explainable by over/underrepresentation of a location in the training data

I suspected that each dot release had multiple blind A/B versions competing for fewest interventions, that after a week they switch everyone over to the better blind version, and that this is reason it initially sucks for some testers and is amazing for others, then becomes amazing for the first group as well.

Jay told me that it’s entirely due to whatever is in the environment (input data). Basically, the perception of a change is due to latent variables + the probabilistic nature of models + the pattern-recognition tendencies of humans. In other words, there is no configuration local to the vehicle that is adjusted in any way.

However, I’m not sure I buy this. I swear I noticed a few poker tells when he understood what I was asking, so I still actually believe that this is the case. A single event noticed by an FSD Beta tester, maybe. But a trend of testers reporting the same thing? Within the same time frame from the initial release download?

They are automatically labeling the categories of human interventions and using the label frequency to prioritize what they need to train for next.

We kind of already knew this, but it was cool to hear confirmation. The training engine is basically automatically choosing what chunks of the problem to eat next, and they use a priority queue to maintain the order of intervention categories to tackle next.

The remaining chunks to tackle are “last-mile.“

Things like pickup (smart summon), dropoff (banish), parking (i.e. pulling into driveways), backing up, etc. This also includes the random things that other humans do which can’t be predicted in advance, which is the real tail of the march of nines.

The push for increasing the take rate of FSD is about ramping up the data collection.

More cars running FSD means more real-world driving data, which means more examples of edge cases, which means accelerating improvements.

Elon and Ashok mentioned that it took around 1m examples before the model started performing really well. It wasn’t clear to me whether that was 1m general driving examples in the training set, or whether that was per-case. I suspect it was the former.

With 6m Teslas on the road, and an average of 14,000 miles driven per year per car, napkin math gives us 85.6B miles annually from which to pull footage to add to the training set, scaling linearly as Tesla deliveries scale.

Any real-world collision or safety event immediately jumps to the top of the pririty queue.

When there’s any accident in a car, it’s immediately prioritized as the next thing that needs to be solved. This makes sense, because it’s effectively identifying a critical “bug” that requires a bug fix.

Tesla has their own internal test drivers deployed across the country to collect data.

This is to avoid data overfit to California. We saw this with Chuck Cook’s turn, when he noticed Tesla employees in vehicles repeating the turn over and over. I imagine this is also the case for any other country where FSD is available or planned.

It requires nearly no effort to adapt FSD for a new Tesla vehicle like Cybertruck.

All they need to do is calibrate the cameras, and then programmatically adjust based on the dimensions of the new vehicle.

It requires nearly no effort (from Tesla) to license FSD to other OEMs.

The only obstacle is on the OEM side. If Ford wants to license FSD, they need to design and build a car with computer-controllable brakes, acceleration, etc. The legacy OEMs are blocked by tight coupling with their Tier 3 and Tier 2 part suppliers. They would basically need to design a new car for it to be able to run FSD.

They are 100% confident that “Operation Vacation” is in effect.

I asked them whether they felt they’d achieved a flywheel, and that it was now merely a matter of feeding data into the training engine. There was absolutely no doubt in their minds.

I forgot to ask about interventions-per-mile and miles-per-intervention, but it’s just as well, since I doubt they would divulge any of that to me. However, I’m certain they’re looking at giant dashboards in the office that display these metrics over time, and I belive their confidence is data-driven. They’ve got to be pretty optimistic after the v12 refactor to an imitation model.

They are 100% confident that level 5 autonomy is a tractable (solvable) engineering problem.

Again, there was absolutely no doubt in their minds.

Generative AI/chatbot best practices

Over the last year, each of us in the industry has been learning how to integrate generative AI together, mostly exploring and figuring things out as we’ve gone along.

This is a collection of lessons I’ve learned from rapid iteration and experimentation in developing a chatbot healthcare assistant.

Use a router that selects the right prompt/tool based on the incoming message

Using a router makes your chatbot “multi-modal” in the sense of having N different modes that it can dynamically select from. This dramatically improves handling of topics by having a dedicated prompt for each topic.

If you’re using langchain, use agents/tools as routes rather than LLMChains

You’re inevitably going to want your chatbot to be able to take programmatic actions, like calling an API for another of your services. The default multi-prompt router only supports using LLMChain destination routes, which limit you to a single model call with a single prompt.

You can probably adapt an agent/tool to an LLMChain interface so it can be used in the router, but I think it’s simpler to adapt an LLMChain to an agent/tool interface. If I started over again, this is one thing I would explore doing differently.

Make your default route a conversation model

I found that using a conversational model to drive your default route will smooth over most rough spots in the conversation.

When the router can’t match to the correct prompt, then at least it will be handled gracefully in a way that feels natural to the user.

This approach was later recommended by seminar speakers at Google’s 2023 Data Cloud & AI Summit, which validated the approach and confirmed that others had also discovered this as a best practice.

Use short, targeted prompts and then compose them

Your prompt is effectively a function. As with functions, each prompt should be a small, sharp tool.

Output quality decreases as prompt length increases. We learned early on that you can’t just cram everything into a prompt and expect the model to know what to do and do the right thing.

We initially tried aggregating a blob of user data, thinking we could inject that into a prompt and the model would just understand it. Not only did it not use the data effectively, but it paid less attention to the other instructions.

Use focused, shorter prompts and then compose model calls.

Use consistent terminology within and across your prompts

LLMs are associative by nature; they effectively encode the relationships between tokens in a domain language.

If you use several different words that mean the same thing throughout your prompt, those associations will be weaker, and your results will not be as consistent as if you standardize to a single term and stick with it.

Double check your prompts to make sure that they all use the same term consistently to refer to the same thing.

Expand the user’s message into a user intent statement

This was a big one.

Use an LLM to expand the literal text of an user’s incoming message into a summary of the user’s intent. Then, use this intent statement instead of (or in addition to) the literal message for your routing and content matching (RAG), and inject it into your final prompt at the very end.

I found that this significantly improved:

Router accuracy (selecting the right prompt to use).
Content matching.
The response from the final model call.
Contextual topic awareness.
Handling of low-effort, one-word messages.

Place a model call at the start of your chain/flow, with a prompt like:

Given the user's INCOMING MESSAGE and the context of the recent CHAT HISTORY, identify the user's current intent and provide a brief summary that captures the main goal of their query as accurately as possible given the available information.
The INCOMING MESSAGE may represent a new intent unrelated to prior topics in the CHAT HISTORY. If so, only mention the new intent.
Begin your responses with "The user".

# CHAT HISTORY
{most_recent_four_messages}

# INCOMING MESSAGE
{incoming_message}

# USER'S CURRENT INTENT

Incorporate into your router prompt:

Given the user's current intent and their incoming message, select the route best suited to fulfill that intent.

You will be given the names of the available route and a description of what each route is best suited for.

<< OUTPUT FORMAT >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
 {
     "destination": string \\ key of the route to use or "DEFAULT"
     "next_inputs": {
         "input": string \\ the incoming message
         "current_intent": string \\ the user's current intent
     }
 }
```

Note: this example was modified in order to render here from markdown; you’d need to re-add curly braces to use.

The beauty of this as a best practice is that it also give you contextual topic awareness for free when you inject the most recent 4 messages.

The summary of the user’s intent is also a statement of the current topic, and placing intent identification with chat history at the front of your architecture makes everything downstream aware of whether the current message represents a new topic, or a continuation/expansion of the current topic.

When the model asks a clarifying question, or the user tacks on a confirmation or negation, the intent is then reinterpreted in the context of the chat history, so all of the user’s recent messages on the same topic are considered together.

User: taco
[Intent: The user may be requesting information about tacos, or a taco recipe.]
Chatbot: Would you like information about tacos, or a taco recipe?
User: recipe
[Intent: The user has clarified that they would like a taco recipe.]
Chatbot: Sure! Here you go... $TACO_RECIPE
User: no lettuce
[Intent: The user has clarified that they would like a taco recipe modified to exclude lettuce.]
Chatbot: Of course, here's a recipe without lettuce... $TACO_RECIPE_MINUS_LETTUCE

Increase temperature to improve compliance with instructions

Paradoxically, I’ve found that compliance with a list of instructions increases when temperature increases.

It’s as if your model parameters can sometimes paint the model into a corner, where it doesn’t have enough available options to generate output that follows your instructions.

If your model isn’t behaving, try bumping up the temperature. Give it a wider lane, and then trust it to generally stay in the middle.

Suppress outbound links with prompt instructions + a route that handles requests for them

One of our product requirements was that the chatbot not serve any outbound links. Not only are these prone to hallucination, but we also have controls on which sources are approved for medical/health information for our users.

Outside of constitutional AI (placing a final call to evaluate whether any constitutional principles have been violated, then revising the output), the next best way to handle this is to add a route to detect requests for links and resources, and handle these with something like I'm not able to send links, but a web search should help you find what you need, etc. Then, add a clause to prompts for your other routes like ...without sending outbound links or URLs.

Use the prompt to get multi-language support/localization for free

We track our user’s preferred language. Since LLMs are associative by nature, they are also (un)surprisingly good at translation.

I found that adding a line Respond in the user's preferred language, or English if not set effectively gets you localization for free.

Is it perfect? Maybe not, but I’ll leave that to the translators to assess. It’s pretty good considering that it costs nearly no additional work.

Validate unsupported characters as soon as you get the user’s message

If a user is trying script injection with prompt-y, Jinja-y chars like { or }, you’re going to have a bad time. These will break prompts, and if you’re injecting chat history that contains a message with these chars, then that user will no longer be able to use the chatbot (every response will load the naughty message and break).

These can also occur if the user asks for an example of Hello World in Java, etc.

The approach I used was to validate messages immediately to catch these on the inbound side, as well as to add a route specifically for requests for code to catch these on the outbound side.

Use few shot examples for longer prompts

If you have longer prompts where you need to inject a lot of data, including a section with few shot examples is really effective. These basically “prime the probabilistic pump,” adding weight to the relationships they illustrate.

This can counter the effect where increasing prompt length decreases quality (adding more elements makes each element less important).

Evaluate different models and use the best

We discovered that using PaLM’s textembedding-gecko@003 for our embeddings and vector store queries was significantly inferior to OpenAI’s text-embedding-3, with an f-score (harmonic mean of the precision and recall) of 0.30 vs. 0.85. In fact, the latter was on par with PaLM’s Gemini encoder.

This was surprising, because we’re a Google Cloud Platform shop, but illustrates that in terms of performance for your architecture, it pays to evaluate and shop around.

Display generated follow-up messages for the user to tap on

The average user doesn’t understand yet how to interact with a chatbot. We found significantly increased engagement once we added a feature that just displays 3 potential responses the user can send with a single tap (see: Smart Replies in Android SMS messages).

This also has the benefit of being “self-onboarding,” meaning that it’s immediately intuitive to the user what these are for, and they also teach the user how to interact with the chatbot, i.e. what kinds of messages can be sent.

Implementing this is dead simple too. Since LLMs predict, you just need to pass the chat history into the model with the most recent user message appended, and the output is a candidate response.

Use a single model call to create a pseudo-batch of predictions

Adding to the last concept, each call can add 1-2s to your total latency. Rather than ask for N predictions in a batch, it’s more efficient to include instructions to generate an array of N strings, etc. and then parse the output to JSON to validate it.

For smart replies, I’m asking the model for 3 messages that the user could send in response to the chat history, with an emphasis on questions; these are more engaging and facilitate continuation of the conversation.

Log every prediction

Log everything. You’re going to want the compiled prompt, the model used, the prompt used, the model parameters used, the chat history, the user’s message, the user’s intent, the user ID, the model’s prediction, any content matching during RAG, the RAG parameters, the RAG results and their distance, etc.

Just log everything.

Evaluate everything

If you’re not evaluating, you’re navigating by feel alone. Because generative AI is probabilistic for any one prediction, and statistical across all predictions, you have no actual idea whether a change has made the end user experience better or worse unless you run evaluations and split test.

At a minimum, I recommend evaluations for router precision and recall (combined into f-score), RAG answer (summary) relevance, RAG content (matching chunks) relevance.

If you haven’t created evals yet, you basically need to create a synthetic data set, label it with the expected attributes (routes), run your component (router, etc.) then evaluate actual results against the expected label.

Make your user aware of the features

We found out from early rollout surveys that most of our users weren’t even aware that we’d given them a chatbot assistant.

This might be obvious, but if the user doesn’t know they have a chatbot, they won’t use it.

Understand who your user is and how they feel

Most of your users are not going to be techies who think gen AI is fascinating. Many of them find AI of any kind scary and weird, and probably have an aversion to it before they even come to your feature. Make sure you keep this in mind when choosing a name/persona.

For people to trust AI, it needs to be personable and welcoming.

Understand how your user actually uses the feature

Most of your users are not going to send eloquent messages. They will treat your chatbot like an AT&T or Comcast voicemail menu, using one-word commands or phrases like “operator” or “billing.”

To handle this effectively, use the intent expansion technique listed above.

LeetCode 219 - Contains Duplicate II

LeetCode - Contains Duplicate II

Principles

Use storage to track iteration progress.
A dictionary combines a set (keys) with a relationship (values for each key).
Use a dictionary to collect uniques + additional data.
Example: tracking pairs of values during iteration, when one value is static (list element) and the other is updated (last seen index). Add the static value as a key, check for the key’s existence, and update dynamic value for that key as you go.
When searching for qualifying pairs, only store the minimum you need for comparison (the most recent index, because you already have the current index).

Prompt

Given an integer array nums and an integer k, return true if there are two distinct indices i and j in the array such that nums[i] == nums[j] and abs(i - j) <= k.

Constraints:

1 <= nums.length <= 105
-109 <= nums[i] <= 109
0 <= k <= 105

Discussion

We’re looking for numerically identical elements in a list, but we have an additional criterion for the relationship of their indices if we find a match: that the absolute value of index 1 minus index 2 be less than or equal to some limit k.

What happens to this absolute value as we iterate through a list of length 3?

i	j	abs(i-j)
0	0	0
0	1	1
0	2	2
1	0	1
1	1	0
1	2	1
2	0	2
2	1	1
2	2	0

We can see that the absolute value increments to a limit, then takes the value of i, and cycles this pattern.

Solution 1 - O(N^2)

def containsNearbyDuplicate(nums, k):
    def criteria_one(nums, idx1, idx2):
        return True if nums[idx1] == nums[idx2] else False

    def criteria_two(idx1, idx2, k):
        return True if abs(idx1-idx2) <= k else False

    for i in range(len(nums)):
        for j in range(i+1, len(nums)):		
            if criteria_one(nums, i, j) and criteria_two(i, j, k):
                return True
    return False

This is a brute force solution that works, but has a lot of travel waste due to the nested loops checking combinations we’ve already seen.

We should see if we can refactor this to a single linear pass using storage to track our progress so far.

Solution 1 - O(N)

The termination condition contains two criteria: a duplicate value, but also a numerical relationship between the indices of the duplicates.

So we need to track values for the equality check, as well as the indices of two values.

We already have the current index from iteration, so we only need to track one index for any value we’ve seen before.

So we need to track the value, and the index we most recently saw that value at.

Since we have two values to track, and since one of them won’t change (element value) and the other will change (last index of the value), we can use a dictionary where the key is the element value, and the property is the last seen index.

The dictionary keys also function as a set, so we can efficiently check if we’ve already seen an element.

We’ll init a storage map, and do a single linear pass through the set of nums.

For each num, we’ll:

check if it’s in the map keys yet
if so, we have a duplicate, so we’ll check the relationship of the current index to the most recent index stored in the map.
if the relationship of indices meets the criteria, we’ve found a qualifying match and can return true.
otherwise, we update the most recent index for that key and move on
otherwise, if the num isn’t in the map, add it as a key, and store the current index there.
if the loop terminates, no qualifying pair was found, so we return false.

def containsNearbyDuplicate(nums, k):
    # Create a dictionary to store the last seen index of each element
    num_to_last_idx_map = {}
    # Iterate through the list
    for i, num in enumerate(nums):
        # Check if the element is already in the dictionary
        if num in num_to_last_idx_map:
            # If yes, check the absolute value of diff b/w current index and the last seen index
            if abs(i - num_to_last_idx_map[num]) <= k:
                return True
            else:
                # If the difference is greater than k, update the last seen index
                num_to_last_idx_map[num] = i
        else:
            # If the element is not in the dictionary, add it
            num_to_last_idx_map[num] = i
    return False

Summary

For an iterative pair check, instead of neesting loops, we use a map/set that tracks two values: the element’s value and it’s last index.

Then we perform checks. Have we seen this element yet? If no, add it. If yes, compare indices. Do they qualify? If yes, return true and we’re done. If no, update index and continue. If no qualifying pair found, return false and we’re done.

See all posts...

Neil Murphy