Grouped-Query Attention is a technique used in large language models to improve efficiency and performance, especially as the model size increases.
Analogy: imagine a teacher answering questions:
- Standard Attention: The teacher answers every question individually, no matter how similar they are.
- Grouped-Query Attention: The teacher groups similar questions and gives one answer for all of them. This saves time and energy.