Large Language Models (LLMs) are going to change the world- are changing it already, frankly. These are going to be as big a change as the internet, in my view. AI has finally arrived in a real sense of causing large visible change to society.

And that’s why I’ve spent a lot of the last two weeks playing with them.

What is this all about?

First stop, understanding what’s different this time. And what’s different is “Attention it all you need”, a paper by some smart folks at Google. This video by Halfling Wizard is a great deep dive into it, though if anyone has a better one feel free to email me it.

The tl;dr is that learning the relationships between words and their positions is a big deal. From the video, “The Avengers defeated Thanos” and “Thanos defeated The Avengers” have all the same letters, all the same words, but in a slightly different order and the meaning is completely different (for half the universe especially!).

Learning the important positional relationship between the words, and doing it on various levels of abstraction on top of one another leads to a powerful neural network architecture when it comes to language- one that becomes ChatGPT when you give it enough training data.

After that, the best video I found was by “code_your_own_AI”, the
LLM Ecosystem explained. This explains a lot about the size of the GPT models offered by OpenAI, compared to LLAMA, Alpaca, and Vicuna. It explains how Fine Tuning works, in which you take a smaller model but make it better fit a specific use-case. It also shows a really fascinating trick in which you take a small model and a small amount of training data, then feed the small training data to a big model like GPT3.5 or GPT4 and have the big model create more training data. This lets you fine-tune your small model using the power of the big model. It costs money to generate that training data but you’re essentially buying some of the intelligence from it.

Very interesting video about the ecosystem of LLMs as they are today- well, a month ago, which is forever in this rapidly changing LLM space.

LangChain

And then I spent a ton of time just mucking about with LangChain and the Agent pattern they use.

The essence of LangChain’s agents is in this file which has the prompts it uses to talk to the LLM.

PREFIX = """Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks, from 
answering simple questions to providing in-depth explanations and discussions 
on a wide range oftopics. As a language model, Assistant is able to generate 
human-like text based on the input it receives, allowing it to engage in 
natural-sounding conversations and provide responses that are coherent and 
relevant to the topic at hand.

Assistant is constantly learning and improving, and its capabilities are 
constantly evolving. It is able to process and understand large amounts of 
text, and can use this knowledge to provide accurate and informative responses 
to a wide range of questions. Additionally, Assistant is able to generate its 
own text based on the input it receives, allowing it to engage in discussions 
and provide explanations and descriptions on a wide range of topics.

Overall, Assistant is a powerful tool that can help with a wide range of tasks 
and provide valuable insights and information on a wide range of topics. 
Whether you need help with a specific question or just want to have a 
conversation about a particular topic, Assistant is here to assist.

TOOLS:
------
Assistant has access to the following tools:"""


FORMAT_INSTRUCTIONS = """To use a tool, please use the following format:
\```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
\```
When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:
\```
Thought: Do I need to use a tool? No
{ai_prefix}: [your response here]
\```"""


SUFFIX = """Begin!
Previous conversation history:
{chat_history}
New input: {input}
{agent_scratchpad}"""

Those values put together along with the list of Tools the agent has access to, the scratchpad of previous tool use, and the chat history, all results in a giant prompt that is sent to the LLM backend.

And that works. The LLM actually responds back with text formatted as described. It specifies what tools it needs to use to achieve it’s goals or answer it’s questions.

Naturally, I decided I had to try to implement this myself to understand it better and it turns out making a good prompt like that is hard. There’s an entire hour-long course you can take on Prompt Engineering and figuring out how to convince the LLM to do what you want given purely English text explaining what you want.

Here’s what I wound up with, which didn’t work as well:

PROMPT = """
You are Cyril, a large language model whose goal is to assist {user}.

You are designed to be able to answer questions using your knowledge and a set 
of tools that can invoke to try to get more information to answer the question.

These are the tools available to you:
\```
{tools}
\```

Your scratchpad from calling tools previously:
\```
{agent_scratchpad}
\```

Your conversation so far:
\```
{chat_history}
\```

Please output your response as a json blob with the following schema:
\```
{
	"actions": {
		"<tool_name>": ["<input_1>"],
		"<tool_name>": ["<input_1>", "<input_2>", ...],
		...
	},
	"output": "<words to say to the user>",
	"logic": "<your logic>"
}
\```

"actions" are tool invocations you need to use to achieve your goal. Your 
scratchpad indicates what tools you're previously used and what the result 
was- do not use the same tool with the same inputs more than once.

"output" is the words you are responding to the user with. They should be one of:
1 - If more information is needed from the user, ask them a question.
2 - If no more information is needed from the user, but tools are being used, 
state in 5 words or less what you are doing.
3 - If no tool is being used and no further information is needed, but the 
answer is not known, say that you don't know the answer.
4 - Otherwise, state the final answer.

"logic" should be a detailed description of how you decided to do what you are doing or how you arrived
at your answer.

You do not give answers that you are not certain of.
You should only output the JSON blob as described above, with no whitespace or newlines.
"""

What I Learned

Three things I learned by making this:

  1. Adding “Be concise” caused the LLM (openAI’s text-davinci-003) to stop using tools. It decided it was better to be extra concise and just not do anything except make up answers.
  2. Asking the LLM to explain it’s actions caused better results sometimes, but also sometimes caused it to double-down on it’s lies. At one point it insisted that Christmas 2028 was on a Saturday because it had consulted a calendar- neither of these things were true!
  3. You can ask it to respond in JSON following a schema and it will mostly do so. But sometimes it will add flair like “Answer: <json blob>”. So you need to search for the json in the response text and then handle it.

On the other hand, I gave it access to a tool to search Wikipedia for information, and it made full use of that, which was cool. And it would explain that it needed to search wikipedia for that information, and then explain that it got the information from the result from wikipedia. So cool!

Until I took away it’s Wikipedia tool and it continued to correctly answer those questions based on its own knowledge. It was searching wikipedia for stuff it already knew the answer to because it’s model was trained on (among a lot of other things) Wikipedia!

Conclusion

This stuff is neato.

My goal next is twofold: First, I want to build a nice proper Agent using LangChain, and give it tools to access stuff that is personal to me. I want to build my Agent.

But second, I want to strap a Voice interface to it. I’m thinking of setting up a websocket and/or WebRTC connection to the server, and sending voice back and forth instead of text. It’s a bit ambitous, but I think it’s doable - and it will look super cool if it works.