I want to outline what i think the LLM’s are doing and my interpretation of how they are doing it. Some basic information is needed just to set the context better, but this is not a “basic introduction to LLM’s” style of post. I wish to go deeper into the possibilities and problems with reasoning and other capabilities that I see.
On concepts and the embedding space
The way I think about concepts in LLM’s is that they are the token probabilities inherent in the source documents used in the training data. For example if I ask “what is Penguin?” to an LLM, there are at least 2 “concepts” it can zoom in on. One is Penguin Books, which is a book publisher, the other is the animal penguin. Which one it responds with comes down to which one is more probable / present in the training data and in the fine tuning / rlhf data. If I add “what is penguin book?” at the end, the word book should narrow down the LLM to Penguin the publisher, and then generate a response based on the fine tuning (in most modern LLM’s they are in a question/answer format for the fine tuning which could lead it to generate an explanation / description of what Penguin Books is, which doesn’t necessarily come from the pre-training data but rather from fine tuning/rlhf, which I’ll get back to). The “concept” of Penguin is an inherent property of the source documents, and this is why having high quality factual data is important. Imagine if you had documents that were complete nonsense but grammatically correct like “The penguin stood on the book and flew over the birds nest next to the blu-ray bookshelf”, and you trained it only on millions of those random ones, it would blend concepts and not be able to tell heads from tails of any of them.
On training the model on more data
The idea that more data helps, is that it should help differentiate between concepts, and if you add fine tuning/rlhf to the data, it should help give more ways to answer a question, more ways to format the data and in general more ways to respond.
Fine tuning vs prompts
In the embedding space, it seems like the more specific documents specific to a topic you have, the more the LLM can “zoom in” on a concept, and then generate probabilities very closely matching the source documents. I think this is because the larger your model is, the more “local embedding spaces” that capture the source documents it is trained on. This is why I think there are sort of 2 modes for the LLM to function on, one is where you consider a very large model with an extremely large amount of high quality pre-training documents, where it can capture the local probabilities of each document well, but where you need to give very long and specific prompts to “zoom in” on those local documents, or, an extremely large amount of fine tuning with a lot of high quality examples of how to answer every possible question in every which way, which leads to much smaller prompts but a lot more fine tuning needed.
On what is in the training data vs what is not, and generalization
Consider if I invent a programming language right now called “Stochastica”, and I create a compiler for it called “Stochasticus”. I create a document on my website where I have a general description and a FAQ of common questions. Now the concepts of this data is not in GPT-4 at all, but very likely the sentences, FAQ, general description and so on is, because of other languages and compilers. Now if I ask GPT-4 “what is the difference between the stochastica programming language and the stochasticus compiler?” It might actually generalize to other programming languages, and give sentences that correspond to the token probabilities of other languages websites and FAQ’s. It can do a drop-in replacement of “stochastica” with “c++” or whatever (i didn’t test this but it does do it on various things). Now you might say, is that really generalization? Isn’t that more like a hallucination? And imo yes and no, it is a generalization because it is generalizing the idea of a compiler and a programming language, and the answers to it, and the only problem is the delta in differences between the features of the two languages. It got several things right, like that it is a question about the difference between a programming language and a compiler, but it answered about the wrong programming language. This is because the concepts of stochastica etc are not in the data, but the other things are. If stochastica was in the data, it could “zoom in” to that local embedding space in the documents, and get token probabilities that correspond more closely to the source documents about it. But the upside is that any similarities between Stochastica and c++ are captured with the response, even if it doesn’t know about stochastica.
This is why it can be problematic to talk about what’s in the training data or not, when interpreting output. Most sentences that an LLM produces are not verbatim in the training data, but the concepts, general sentence structure, as well as output format (from fine tuning) generally are.
Another kind-of sort-of form of generalization is to use the general probability of a sentence from the prompt occuring in the training data, and then creating a response if it is less probable (I think this is fine tuning but not sure).
Like for example if I ask “I was in a taxi and there was an elephant inside, what do i do?” The chance that there is text in the training data where there are sentences that express this are relatively rare, and then it can see this, and respond that “it is rare to see an elephant in a taxi”. Using this sort of “negative probability” you can also get some interesting responses about the likelihood or even falsehood of something which should in turn constrain the rest of the generated tokens as well.
On problems and ways of generalizing and reasoning
One example that shows up is the example of “what is heavier, a kilogram of rocks or a pound of feathers?”
Now one shallow way to generalize this, is to just to pattern matching, where you have a bunch of examples like “what is heavier, a kilogram of (WORD) or a pound of (WORD)”, and then have many repetitions of the phrase with different words. It can then generalize the general expression to fit any word (ignoring the part of the thing where humans think feathers are lighter). However this is the shallow type of reasoning, we want it to really understand these concepts.
Well one thing to rememember in this case is that it can only do it as a secondary side effect of producing the most probable tokens from the training data and the prompt. As humans, we can envision mentally the weight of feathers, how a feather is light in our hands, how a rock is heavier and bulkier, but then use our rational mind to remember what the words kilogram and pounds mean, and then override our intuitive experience of the objects and come to a calculated response that ignores those intutions. For one, I don’t think we would want an LLM that would have to do it in this roundabout way, but second, if it were to use other modalities, the relationship between those modalities in its model would have to be different than in humans (just for efficiency’s sake also).
The second thing is it’s not obvious how to create text that can in a generalized way represent all the different properties of objects without just listing out all those properties. At some point in the training data, you need to have the literal text a pound is heavier than a kilogram, or something like that. I don’t think it’s possible to for example have a table that lists kilograms and pounds, and example weights as numbers, and then have the LLM deduce from that table that one number is higher than another and that a higher number means it weighs more (though maybe it is?). Even if it could, I imagine it would be very hard to have at the prompt stage, the right prompt to trigger all this, considering I don’t think local concepts are automatically promoted to “global concept” (as in, consider 2 different prompts: “What is the python programming language?” and “how do i create a function in python?”. If you give those 2 prompts to the LLM separately, they don’t automatically connect “python” from each as the same concept. They will be local to the local embedding space the prompts trigger, and if you want them combined you have to include them both in the prompt at the same time or do a Chat style thing, but then you are limited to context length). To circumvent context length you would have to manually combine all the different ways you can prompt it to compare feathers and pounds and kilograms and rocks in the training data so that all those different prompts trigger their own self contained local embedding space that include all the needed information or have a large context length or some kind of streaming endless prompt thing (which is still fundamentally limited context length because of ram etc, but I guess humans also don’t have unlimited context length either, but they do have much more coherent relationships between concepts and how to relate them in such a “streaming mode”)
On hallucinations
Taking the above into account, the LLM still can’t verbatim reproduce a source document (at least not most of the time). Even in very specific prompts and very specific fine tuning, there will still be leakage from other local embeddings, however more training data and more fine tuning should in general help with this. This is not only because of temperature and sampling method, but also an inherent property I think. There is no way that I’m aware of to create a completely clean embedding space where every concept is neatly separated, even at temperature 0. So in my view, a hallucination is when there is “leakage” and the probability of a set of tokens produces a word or sentence that is not in the correct “local embedding space”. If a concept like Stochastica doesn’t exist in the data, then it might hallucinate an entire response, but if Stochastica exists and has a lot of data, it might hallucinate a single word or function, but everything else is correct. I’m not sure how to make it so every single probability for every token in every response is always correct and accurate, but as said, more data should help. Imagine a training set with 10 billion high quality, manually checked to be correct, javascript snippets. There would be a lot of solid probabilities about javascript in there.
On the way LLM’s think vs humans (the loop)
There is some discussion about the difference between humans and LLM’s, in particular in regards to the way LLM’s produce the next token in a loop, based on the probability of what word is in current context. A human also has to have a loop to produce the next thought, and this is influenced by the previous thought.
The thing here I think, is not so much the method (but a little bit which I’ll get back to), but more the data. An LLM is sentence based and token based, and the way it generates tokens is based on these probability spaces captured from the source documents. A human has (to my best knowledge) a set of patterns (100 million of them?) and then a set of computations to compute on those patterns.
There are sort of 2 problems:
1) the longer a prompt and a response for an LLM is, the more constrained the probability space to the source documents it becomes. Like when the LLM starts to generate a response about, I don’t know, the war in iraq, it will start out with the first token based on some criteria probability (like the fine tuning), and then as it generates, it will “zoom in” to specific documents/embeddings which will narrow down the future probabilities (also based on temperature). A human on the other hand, seems to in theory, be able to generate almost infinitely. I bet if I ask a human to generate a bunch of sentences no matter how unrelated, they could do it infinitely (trump ate a burger then flew to washington in a boeing 474 and one passenger on the plane was a woman who came from kentucky which is not related to kentucky fried chick nor is it related to farms… Etc etc). I don’t think an LLM can “undo” it’s probabilities already generated like this (maybe it can?) and then generate infinitely. But the other problem is…
2) The human has a world model whose concepts are based in the real world, and each concept has a lot of information about them (like trump, burger, boeing, “passenger”, all of these have a lot of related information in any persons brain) and the general reasoning ability has multiple “levels” where as the LLM only has one way to reason, which is to indirectly get the reasoning from the training data, or in the case with the programming language, generalize some concept from another concept and then use it for the current concept. For an LLM every ability to reason is secondary to the total possible probability distribution in the training data, and sort of what the objective number of different reasoning exists in that data (kind of “top down”). Where as a human is almost bottom-up. A human can zoom in on “burger” and explain all the facts, use cases, history, etc of a burger, and reason about the real world implications of a burger (it’s size, taste, and even if other animals will like it – “dogs like meat, will a dog like a burger?”)
In theory if you can figure out what the best generalizable fine tuning examples and training data are, as well as cover as many concepts as possible, you should in theory be able to cover everything, or close to it (very speculative). Every generalizable computation humans have also fundamentally comes down to some procedure/algorithm on some set of data, and almost in an OOP fashion, not only does every piece of data have its associated algorithms, but humans can try different algorithms to new until-now untried data (opening a wav file in notepad). etc… So a big part of LLM’s I think, is to figure out a list of all the computations available for each concept/object and then figure out texts that can relate those computations and objects.
Well with this rather long post, I’m having trouble remembering if I remember everything I wanted to write, but there we are.