In my previous 2 blog posts, I explained, as I would explain to my mum, what is AI, and how machines can understand our words. In this blog, I explain heuristically (non-technical / common sense) how machines can not only understand what we are saying, but how they can respond intelligently and create answers that were not there before (Gen AI)
One more little detour
For those of you who have played with ChatGPT or something similar (Bard?), one of the things that puzzles people is the concept of ‘tokens’. Some of you may ask, since I claim that machines can understand human words well enough, what is this token thing? Are we gambling when using these ‘tools’?
Gambling? Yes may be… Some examples of funky ChatGPT (and Bard) results
- ChatGPT making up cases to
support its argument (1)
- ChatGPT making up cases of
sexual harassment and naming the supposed perpetrator (2)
- Bard makes mistake regarding
James-Webb Telescope (3)
There are ways to mitigate these issues, but this is beyond the scope of this blogpost. Suffice to say that such models do give information that may be less than reliable. But then again, they were not designed to ‘tell the truth, the whole truth, and nothing but the truth’.
ChatGPT/BARD are not designed to tell
the truth?! I thought they were AI!
The answer to the question lies in the
question itself. These are AI systems, and as I mentioned in my 1st
blog of the series, such models learn from (are trained on) data they are fed.
Secondly, it may help to understand a bit how these systems, they are called LLMs
(Large Language Model), work.
How does ChatGPT/Bard… work?
Let me start by a word game. How many of you play wordle (4)? Basically everyday, a 5 letter word is chosen, and you have to guess the word without any clue, you have 6 tries. All that you will ever know is whether the letter you have suggested exists in the answer but is in the wrong slot (yellow) or in the correct spot (green) or does not exist at all (black). The other condition is that any combination of letters you try has to be an existing word.
The thing is, most people, once they know the position of one letter, will try to guess the letters next to it based on what they know about the English language, for example (5):
- ‘E’ is the most common letter in English and your best bet if you know nothing about the word, this is followed by ‘T’ and ‘A’.
- If there is a Q, chances are there has to be a U, and chances are the U follows the Q
- If there is a ‘T’ except in 5th position, then the next letter is likely a ‘H’ next is ‘O’ and next is ‘I’
- If there is a ‘H’ except in 5th position, then the next letter is likely a ‘E’ next is ‘A’ and next is ‘I’
Combinations of 2 letters such as ‘QU’, ‘TH’, ‘TO’, ‘TI’ are called bigrams. The idea is that once you know a letter, you use this information to find the most likely following letter – this is known as conditional probability, based on the condition that one letter is an ‘T’ then the most likely following letter in an ‘H’, not an ‘E’, the most common letter in English. The key is that your choice of letter changes based on information you have.
These are shortcuts, findings based on
analysis of words, that can help you guess the letters in wordle.
Bigram Popularity |
English |
French |
Spanish |
1 |
TH |
ES |
DE |
2 |
HE |
LE |
ES |
3 |
IN |
DE |
EN |
4 |
ER |
EN |
EL |
5 |
AN |
ON |
LA |
Letters are fine, but Gen AI generates
whole documents, not random letters
It’s just an extension of the idea. In the
example above, I used bigrams (2 letters), when playing wordle, some people may
choose trigrams (3 letters), it’s basically the same thing, just a little bit
more complex.
The next step then is that instead of guessing the next letter (using a bi-gram), you guess the next word. But why stop there? You can actually go beyond a bi-gram and use multiple letters (here words). It’s, in principle, that straightforward. However, to improve the performance, there are a few more tricks.
The problem is the size of the data; given the number of words, the combinations possible increase exponentially as you add more words. The brilliant, or one of the more brilliant, things about LLMs is that they generate a probability of a combination of words occurring. They do that by using an underlying model and, recognise the patterns.
AI, NN, and the human brain
As mentioned in Part 1 of this blog, AI is
about making a machine think like a human. The way this has been done in Neural
Networks is to make a representation (model) of the human brain, with nodes and
connections. And as it is thought with the human brain, each node does a fairly
simply job (one of the simplest jobs is a binary yes/no or a threshold – in
this case called a perceptron), and the connections between them are given
weights based on how important they are.
Note that as Neural Nets have progressed,
they have taken a life of their own and the idea of mimicking the human brain structure
is not central, the architecture of neural nets, while using nodes and
connections can be different.
Going back to the chair and table example
When you show the machine a picture, it
breaks it down into small parts of the picture (features), may be the length of
the leg, the shape of the back, and assigns weights based on how important
these features are. After being trained over many examples, the model is ready
to distinguish between table and chair.
The illustration above shows a very simple type of Neural Network, one input layer where you start, one hidden layer of nodes and connections in one direction to do the magic, into the output layer. For the table chair classification from images for example, it has been found that neurons arranged in a grid formation work well, specifically a Convolution Neural Net. Basically, a set of filters is applied to detect specific patterns in the picture (convolutions), then these are summarised and combined (more layers) to extract the more salient features without burning enormous resources, and finally pushed to the output layer; in the case of our chair/table classification there would be 2 nodes in the output layer, the output being the probability that the image fed is a chair or a table. (9)
There are many ways to structure a neural net, many parameters to play with. You wouldn’t be surprised that one of the important innovations was that, for processing text it is important to know what else is in the sentence, and not process each word independently. So, there was a need to be able to refer to past Long Short Term Memory (LSTM) (10) allowed this to happen by allowing the user to control how long some nodes would retain information, and hence be used to provide context.
However, LSTM is not that fast as it processes information sequentially, like many of us do, we read word by word.(11). In 2017, a team from google came up the brilliantly entitled paper “attention is all you need” (12). This gave rise to the rise of Decepticons (13), sorry, to Transformers(14). Basically, the machine, when processing a chunk of text, calculates weights using an attention network, calculating what words need to be given a higher weight. While Transformers can be run sequentially, they can also be run in parallel (no recursion), hence the usefulness of GPUs in LLMs.
To answer a friend’s question, GPUs are not necessary in LLMs, but they really speed things up. (15)
Is LLM therefore just a better chatbot?
You must be thinking that LSTM is something that has been used in Chatbots before, and LLMs, as I have explained here, basically just answer your queries…
Actually no. One huge difference between chatbots and LLMs is how they learn. LLMs use reinforcement learning (I sneakily introduced this in Part I of this series, there even is RLHF Reinforcement Learning from Human Feedback...), also the volume and diversity of data that these have been traditionally trained on is vastly different. LLMs can ‘talk’ about many more topics/intents than a traditional chatbot that is usually more focused.
However, the comparison with a chatbot is an interesting one. The interest in LLMs really took off with GPT3.5. As the name suggests it is not the 1st offering in the GPT family of OpenAI. So what made GPT3 garner so much interest (GPT-1 was released in 2018, GPT2 in 2019, GPT3 in 2020, and GPT3.5 in 2022 (16))? One was that it suddenly improved, and second that a friendly chat interface was included, allowing virtually anybody with an internet connection to play with it, and become an instant advocate.
A few more points
GenAI, here LLMs, basically smartly and quickly process word/token embeddings to understand you, and produce a response. The key to understand them, as I mentioned earlier is to know they are not designed to give you the truth, but they answer: “what would a likely answer be?” Actually, not only that, GenAI gives you the likely answer of an average person (Thank you Doc for pointing this out clearly). Think about it, if it is trained on the whole internet, and ranks the most likely answer, then the most likely answer may not be that of people who really know what they are talking about. Hence, my thought that LLMs can help so-so coders, but expert coders may not be helped that much, they probably know better.
Questions to ponder:
- Do you believe that logic is something that is common in humankind? Is common sense really that common?
- How about Maths, do you believe that people are generally good or bad at Maths?
- Why am I asking this? Simple, now tell me, do you think, LLMs are good at logic? At Maths?
Is most likely always the best?
Now, there’s one more thing is that you can
influence: what GenAI responds to you. I mentioned that they basically rank all
possible words and pick one; may be your first instinct is to always pick the
highest probability word.
That would give you consistent answers over
time. However, always using highest probability response often leads to
circular and less than satisfactory answers. Hence, most people choose to allow
some randomness (ChapGPT calls this temperature(17))
Conclusion:
GenAI is a great tool (what you can do with
GenAI, whether you are from an SME, an individual looking to make your own life
easier, or a large organisation may be a topic for a next blog). What it does
it come up with is a possible answer based on the data it has been trained on. (Actually
another blog post could be why GenAI is not the answer to everything, but
that’s probably obvious)
- https://www.channelnewsasia.com/business/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-3581611
- https://www.businesstoday.in/technology/news/story/openai-chatgpt-falsely-accuses-us-law-professor-of-sexual-harassment-376630-2023-04-08
- https://www.bbc.com/news/business-64576225
- https://www.nytimes.com/games/wordle/index.html
- http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/english-letter-frequencies/
- http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/french-letter-frequencies/
- http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/spanish-letter-frequencies/
- you can also adjust how you penalise mistakes, known as the loss function; so that’d be a 4th way.
- https://www.simplilearn.com/tutorials/deep-learning-tutorial/convolutional-neural-network
- http://www.bioinf.jku.at/publications/older/2604.pdf
- LSTM evolved from Recurrent Neural Networks (RNN) where the idea was that you can look back at information you processed earlier (hence recurrent), however if the information was far back, there were problems referring to it.
- https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- https://tfwiki.net/wiki/Rise_of_the_Decepticons
- https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
- Memorable demo of CPU vs GPU https://www.youtube.com/watch?v=-P28LKWTzrI
- https://en.wikipedia.org/wiki/GPT-4
- https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api-a-few-tips-and-tricks-on-controlling-the-creativity-deterministic-output-of-prompt-responses/172683