Training AI and LLMs to answer the tough questions

Feb. 26, 2024 / By Nathan Brake

I’ve been a long time user of Mint for their free money management and budgeting tools. I’m sure it comes as no shock that someone working in machine learning is interested in seeing all the charts and statistics that provide a comprehensive picture of my family’s financial health. Intuit recently announced that Mint is being shut down, so I began the search for a new budgeting tool.

After some research, I found a new budgeting tool that I liked and successfully imported and configured all my data. As I was learning about all the features that the new tool offered, I was excited to see they had a new beta feature available: an artificial intelligence (AI) integration with OpenAI’s GPT-4! They explained that this feature was intended for me to ask any question about my finances, and it would help explain what was happening using my own data. After chatting with it for a bit, it was able to answer questions that I already could easily find elsewhere in the app, like, “How much did I spend at coffee shops this month?”

However, it failed to answer the hard questions that would have actually been helpful to me, like, “How much money did I get from all my investment dividends in Q4 2023?” In other words, it was great at telling me things that I already had an easy method to figure out, but it wasn’t yet able to do any advanced analysis that would help me.

This example is finance, but if you’re like me, I suspect you have similar experiences from interactions with large language models (LLMs). It takes a bit of work to understand what questions it can answer, as well as how those questions need to be worded to get the response you need. LLMs are generally excellent at reasoning and explaining well-known information (e.g., “Explain osteoporosis to me like I’m a toddler”), but they can struggle when it comes to synthesis of your data. Why is this? Although this is an active area of research and different methods abound, there are currently two popular high level approaches to achieving acceptable performance on a customer’s data.

  1. Train on the data

The easiest way to increase performance on your data is if the data is a part of the training set. When an LLM is trained, it’s taught how to generate text from a massive pile of text data. (I’m glossing over many details here, but let me know if you’re interested in digging in!) If your data is added to that pile, then the LLM will be able to get much better at reproducing and telling you about it. There are lots of options for how to do this, but generally, your data can either be in the pre-training step (the beginning of model training) or the fine-tuning stage (end of model training). Training the model on your data falls pretty well in line with the common methods for speaker adaptation in the world of speech recognition.

This method works well, but there are several drawbacks. First, these methods can quickly balloon in expense, depending on the size of the network. If you wanted to train a GPT-4 level model from scratch (pre-training stage), it is expensive. OpenAI recently announced it would help large organizations do this with a checkbox in the form to acknowledge “pricing starts at $2-3 million.” Even if you wanted to train smaller “adapter” models (fine-tuning) for individual users, the cost of training, updating and maintaining all these models has an expense attached. Second, if your data is being used to train anything, your data has now been embedded and exported into the new model.

Andrej Karpathy (previous director of AI at Tesla, now at OpenAI) has an amazing and accessible video explanation of LLMs, where he explains that a good way to think of LLM training is “data compression.” The text data that you provide to the model is “compressed” during training into the model’s knowledge. This requires additional security precautions to ensure that the model trained on your data is not used in unintended situations, since this may result in data leakage.

  1. Retrieval augmented generation (RAG)

To resolve the privacy and cost concerns associated with training on customer data, RAG is an approach to inject the necessary information into the actual prompt sent to the LLM. For example, if the user asks an LLM like ChatGPT, “Explain osteoporosis to me like I’m a toddler,” a RAG approach might be first to have an algorithm that performs a Wikipedia search of the term “osteoporosis” and then inserts the first paragraph of the Wikipedia page into the prompt, so that the actual question provided to the LLM is something like:

“Something I found from Wikipedia that may be useful is that osteoporosis is a systemic skeletal disorder … [The rest of the Wikipedia paragraph goes in here]. My question is, Explain osteoporosis to me like I’m a toddler.”

With this approach, the LLM has now been given the technical information it needs (what osteoporosis is), and all it must do is explain the technical information in terms that a toddler could understand. A good demo of the technical details of the approach is written here. Instead of Wikipedia, you can substitute it with a query to your own database. With this method, personal information can now be provided to the LLM without giving personal data to the owner of the LLM for use in their model training. In my initial example of the personal finance app, this is almost certainly their technique. My data was stored by the finance company, and they selectively injected information about my finances into the prompt to GPT-4 when I asked a question.

The advantages of this technique are enhanced data privacy and reduced overall cost, since now it’s no longer necessary to train a model with your data. However, the performance may be worse than fine-tuning, since RAG techniques rely upon successfully inserting the correct relevant information into the prompt. A recent paper investigates the performance of RAG and fine-tuning approaches for agricultural data and found benefits to using both.

Our daily work can be positively impacted by the ability of LLMs to improve further performance on tasks that require a high level of proprietary customer knowledge. RAG and fine-tuning will continue impact across consumer integrations, and I look forward to the day when the standard application integration can answer even the most difficult task that uses my data.

Nathan Brake is a machine learning engineer and researcher at 3M Health Information Systems.