Here is the next installment of the series about transformers and large language models, breaking down the attention mechanism (specifically multi-headed self-attention).
With all these videos, there’s always a tug and pull between adding more detail to preemptively answer anticipated questions, versus keeping it more direct and uncluttered. I’ve generally been erring on the side of more detail, at least by the standards of YouTube explainers, but many things were still cut out before animating.
One was a motivating example of why having many attention blocks in a series might be helpful. Consider the phrase "The glass ball fell on the steel table, and it shattered" (I think this example was in my head from the 99% Invisible episode The ELIZA Effect). How much should the words “ball” and “table” attend to the word “shattered”, say in an attention head trying to relate verbs to their subjects? If this was the earliest attention block, only able to see the meanings and positions of the three words, in isolation, it’s not so clear. The positions and meanings of “ball” and “table” alone don’t give either much of an advantage.
But suppose before reaching this attention head, these word embeddings had gone through an attention block which let "ball" imbibe some of the meaning from "glass", which let "table" imbibe some meaning from "steel". After that, a later attention head now has much more of a chance to recognize that the connection to "ball" should be stronger, assuming the learned embeddings also have more generally learned a connection between glass and shattering.
Another bit cut out from the final involved reflecting on how, as data flows through all these layers, an embedding that started in association with one word may not necessarily retain any meaning for that word. In our adjective-noun example, after the noun embeddings had ingested the meanings of the adjectives, you could image how the embeddings for those adjectives are now effectively free and available for extra processing.
Exactly how the transformer uses these embeddings is, again, not at all clear and an active research process to dissect. A number of people compare the 12,288-dimensional embeddings (or whatever d_model is for a given transformer) as being analogous to working memory, where the transformer can read and write from various subspaces in that space.
nice~
Dear Grant,
Please have a look at how relationships between prime and composite numbers may determine how our universe is built from a big bang of ordered light explosion (each ray with a serial number) to a big crunch into a singularity and back again inflating/deflating model.
See:
http://heliwave.com/114.pdf
http://heliwave.com/114.doc
http://heliwave.com/114.txt
or
http://qurancode.com/114.pdf
http://qurancode.com/114.doc
http://qurancode.com/114.txt
Good luck to humanity.
Ali Adams
God >