To encode the position of the current token in the sequence, the authors take the token's position (a scalar i, in [0-2047]) and pass it through 12288 sinusoidal functions, each with a different frequency.
The exact reason for why this works is not entirely clear to me. The authors explain it as yielding many relative-position encodings, which is useful for the model. For other possible mental models to analyze this choice: consider the way signals are often represented as sums of periodic samples (see fourier transforms, or SIREN network architecture), or the possibility that language naturally presents cycles of various lengths (for example, poetry).
dugas.ch | The GPT-3 Architecture, on a Napkin
Filed under:
Related Notes
- ChatGPT, it turns out, [lives off the same water I drank growing up...from Field Notes from Christopher Brown
- A bay is a noun only if water is dead. When bay is a noun, it is de...from Robin Wall Kimmerer
- Among our Potawatomi people, there are public names and true names....from Robin Wall Kimmerer
- The upshot for the industry at large, is: the **LLM-as-Moat model h...from Steve Yegge
- Words matter, and the words you say about yourself and other people...from Josh Beckman
- 1. The code that gets written is the code that’s easier to write. 2...from Fernando Borretti
- > Any sufficiently complicated [C](https://en.wikipedia.org/wiki...from From Wikipedia, the free
- We don't quite know what to do with language models yet. But we...from Maggie Appleton