LLM の核心をやさしく解説

A gentle guide to the core of LLMs

イメージでわかる Attention

Visual Attention for Beginners

ChatGPT や LLaMA のような大規模言語モデル（LLM）の中心にある「Attention（注意機構）」を、遠くの単語どうしの関係を見る仕組みとして、数式の前にイメージで理解するページです。

This page builds an intuition for Attention — the core mechanism of LLMs like ChatGPT and LLaMA — as a way of relating distant words, before touching any maths.

📚 このページを読む前に：「畳み込み」や「フィルタ」が何かを知っていると理解が深まります。まだの方はイメージでわかる深層学習を先に読むことをおすすめします。行列の計算が気になる方はイメージでわかる線形代数も参考にどうぞ。
📑 数式の記号が不安な方は数式の読み方ガイドもあります。 📚 Before you read: This page builds on convolution and filters from Visual Deep Learning. If matrix arithmetic is unfamiliar, Visual Linear Algebra covers the groundwork.
📑 Unsure about notation? See How to Read Math Notation.

① なぜ Attention が必要なの？

① Why do we need Attention?

畳み込みは「小さい窓の中だけを見る」演算です。画像のように「隣のピクセルとの関係」だけ見ればよい場面では十分ですが、文章では事情が違います。

Convolution looks only within a small local window. For images — where only neighboring pixels matter — that's fine. But language is different.

例文で考えてみよう： Consider this example:

「太郎はケーキを食べた。そのあと彼は眠った。」

"Taro ate the cake. After that, he fell asleep."

「彼」が誰かを理解するには、ずっと前の「太郎」を参照しなければなりません。小さい窓では届かない「遠距離の関係」です。

To understand who "he" is, you must reach back to "Taro" earlier in the text — a long-range relationship that a small window cannot capture.

文章・コード・音楽など「系列データ」には、こういった遠距離の依存関係がたくさんあります。これを解決するために生まれたのが Attention（注意機構）です。

Sequential data — language, code, music — is full of such long-range dependencies. Attention was designed to solve exactly this problem.

② Attention のざっくりイメージ

② The big picture of Attention

Attention をひとことで言うと、こうです：

In one sentence, Attention is:

「今この単語を処理するとき、文中のすべての単語を参照して、関係が深そうな単語ほど強く注目する仕組み」

"When processing the current word, look at all words in the sequence and pay more attention to the ones that seem most relevant."

畳み込みが「窓の中だけ」を見るのに対し、Attention は距離に関係なく文全体を一度に見ます。「彼」という単語を処理するとき、文の先頭にある「太郎」にも強く注目できます。

Where convolution looks only "inside the window," Attention sees the entire sequence at once, regardless of distance. When processing "he," it can strongly attend to "Taro" at the very beginning of the sentence.

③ 「重みづけ」ってどういう意味？

③ What does "weighting" mean?

Attention は全単語を「同じ強さで」見るわけではありません。関係が深そうな単語を強く見て、無関係な単語は弱く見ます。この「注目の強さ」を数字にしたものが注目の重み（Attention Weight）です。

Attention doesn't look at all words with equal strength. It looks strongly at relevant words and weakly at irrelevant ones. The number that represents this "strength of focus" is called an Attention Weight.

たとえるなら： Analogy:

図書館で調べものをするとき、手元の本（注目したい単語）に関係する棚（他の単語）を見に行きます。「この棚はすごく関係ある！」「この棚はあまり関係ない」と重要度を判断しながら情報を集める感覚です。

Imagine researching in a library: from the book you're reading (the current word), you visit shelves (other words) around the room. You judge "this shelf is very relevant!" vs "this one not so much" — and collect information weighted by relevance.

④ softmax ってなに？

④ What is softmax?

各単語ペアの「関係の強さ」に点数をつけます。でも、そのままでは点数の大きさがバラバラで使いにくいです。

A "relevance score" is computed for every word pair. But raw scores are awkward to work with as-is — they're just arbitrary numbers of different sizes.

そこで softmax の出番です。softmax は、バラバラの点数を 「どの単語にどれだけ注目するか」の振り分け に変える仕組みです。関係が深い単語には大きく、あまり関係ない単語には少しだけ注目が配られます。

That's where softmax comes in. softmax turns those uneven scores into a distribution of attention — assigning a lot of attention to relevant words and very little to unrelated ones.

おこづかいのたとえ： The allowance analogy:

「注目ポイント」が 10 枚あるとして、「太郎」「ケーキ」「食べた」に配るとします。点数が高かった「太郎」にはたくさん、低かった「食べた」にはほんの少しだけ配ります。

Imagine 10 "attention tokens" to hand out among "Taro", "cake", and "ate". Most go to "Taro" (highest score), very few to "ate" (lowest score).

softmax は、ちょうどこういう「注目の配り方」を自動で決める仕組みです。

softmax automates exactly this kind of distribution.

点数 → 注目の振り分け

Scores → attention distribution

点数（バラバラな数字）

Raw scores (uneven numbers)

太郎: 8.2 ケーキ: 1.4 食べた: 0.9

→

softmax
大きいほど多く配るbigger score → more attention

→

注目の配り方

Attention distribution

太郎: 87% ケーキ: 8% 食べた: 5%

「彼」を処理するとき、注目の 87% を「太郎」に向ける、というイメージです。全部合わせると 100% ぶん配りきった状態になります。

When processing "he," 87% of attention goes to "Taro." Together all shares add up to 100% — all the attention has been distributed.

なぜ「全部で 100%（＝合計 1）」になるように揃えるかというと、あとで「注目の強さに比例して情報を混ぜる」計算をするためです。全部配りきった状態にしておくと、その計算がきれいにまとまります。

The reason the shares are kept at 100% (sum = 1) is that the next step blends information in proportion to these shares. Having them sum to 1 makes that blending step clean and well-defined.

⑤ Query・Key・Value — 3つの役割

⑤ Query, Key, Value — three roles

Attention を計算するとき、各単語は3つの役割を同時に担います。

When computing Attention, each word plays three roles simultaneously.

記号	名前	役割のイメージ	図書館の比喩	Symbol	Name	Role
Q	Query（質問）	「自分はどんな情報を欲しいか？」	図書館で調べたいテーマ	Query	"What information am I looking for?"	The research topic you bring to the library
K	Key（目印）	「自分はどんな情報を持っているか？（の要約）」	各棚の背表紙ラベル	Key	"What information do I hold?" (a summary)	The label on each shelf's spine
V	Value（内容）	「実際に渡す情報の中身」	棚の中の本の内容	Value	"The actual content I pass along"	The actual books on the shelf

Q と K を比べてスコアをつける Compare Q and K to get a score

「自分が欲しい情報（Q）」と「相手が持っている情報の要約（K）」を内積（ドット積）で比較します。似ているほどスコアが高くなります。

The Query (what you're looking for) is compared with each Key (what others hold) via dot product. The more similar they are, the higher the score.

softmax でスコアを注目の配分に変える softmax turns scores into an attention distribution

前のステップのスコアを softmax に通して「どこにどれだけ注目するか」の配分に変えます。関係が深い単語ほど大きな値になります。

The scores from step 1 go through softmax, which converts them into an attention distribution — more relevant words get a larger share.

注目割合で V を加重平均する Weight-average the Values by attention weights

「関係が深い単語ほど強く参照する」ように、注目割合に応じて各単語の Value を混ぜ合わせます。これが Attention の出力です。

Values are blended proportionally to attention weights — more relevant words contribute more to the output. This mixture is what Attention outputs.

⑥ いまの流れを式にすると

⑥ Putting the steps into a formula

ここまでの3ステップを順番に式にしてみます。いきなり全部わかる必要はありません。「さっきの3ステップがどこに対応しているか」だけ追えれば十分です。

Let's write those three steps as formulas one at a time. You don't need to understand everything at once — just trace where each of the three steps appears.

ステップ 1 — 関係の強さを調べる

Step 1 — measure relevance

$$\text{score} = QK^\top$$

「いま欲しい情報（Q）」と「各単語の目印（K）」を掛け合わせて、関係の強さスコアを作ります。

Multiply "what I'm looking for (Q)" by "each word's label (K)" to get a relevance score.

ステップ 2 — 注目の配分を決める

Step 2 — decide the attention distribution

$$\text{weights} = \text{softmax}(QK^\top)$$

スコアを softmax に通して、どこにどれだけ注目するかの配分に変えます。

Pass the scores through softmax to get how much to attend to each word.

ステップ 3 — 情報を集める

Step 3 — collect the information

$$\text{output} = \text{softmax}(QK^\top)\,V$$

配分にしたがって各単語の中身（V）を集めます。関係が深い単語からはたくさん、浅い単語からは少しだけ受け取ります。

Collect each word's content (V) according to the distribution — more from relevant words, less from unrelated ones.

3 ステップをまとめると

All three steps combined

$$\text{Attention}(Q,K,V) = \text{softmax}(QK^\top)\,V$$

この式は新しいことを何も加えていません。ステップ 1〜3 をそのまま 1 行にまとめただけです。

This adds nothing new — it's just steps 1–3 written on one line.

補足：$\sqrt{d}$ で割るのはなに？

Note: what is the $\sqrt{d}$ division?

実際の Transformer では式が少し違い、スコアを $\sqrt{d}$ で割ってから softmax に入れます：

In practice, Transformer divides the scores by $\sqrt{d}$ before softmax:

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$

$Q$ や $K$ のベクトルが長いほど内積の値が大きくなりやすく、softmax が極端な配分を出してしまうことがあります。$\sqrt{d}$（$d$ はベクトルの長さ）で割ることでスコアを適度な大きさに抑えます。最初の理解では「スコアが暴れすぎないようにする調整」くらいで十分です。

Longer $Q$ and $K$ vectors tend to produce large dot products, pushing softmax toward extreme distributions. Dividing by $\sqrt{d}$ (where $d$ is the vector length) keeps scores in a reasonable range. For now, think of it as "a tweak to stop scores from getting too large."

⑦ Multi-head Attention ってなに？

⑦ What is Multi-head Attention?

ここまでの Attention は「1 種類の見方で文全体を参照する」仕組みでした。でも実際の文には、同時に気にすべき関係が複数あります。

The Attention we've seen so far uses one "view" of the sequence. But real language has multiple kinds of relationships worth tracking at the same time.

たとえるなら： Analogy:

1 人の先生だけで文を読むのではなく、

Instead of one teacher reading the sentence, imagine:

文法のつながりを見る先生
a teacher who looks for grammatical connections
意味が近い単語を見る先生
a teacher who spots semantically similar words
代名詞が誰を指すかを見る先生
a teacher who tracks what pronouns refer to

が同時に文を読んでいる感じです。それぞれの先生が違う視点で関係を見つけ、最後に結果を合わせます。

All reading simultaneously, each from a different angle, then pooling their findings.

Multi-head Attention は、Attention の計算を h 個の「head（頭）」で並列に行い、それぞれが異なる関係を学習できるようにした仕組みです。結果を最後に結合することで、1 種類の Attention よりも豊かに文の情報を捉えられます。

Multi-head Attention runs $h$ Attention computations ("heads") in parallel, each free to learn a different type of relationship. The results are merged at the end, giving a richer picture of the sequence than a single Attention head could provide.

⑧ Transformer は何をしているの？

⑧ What does Transformer do?

Transformer は、Multi-head Attention と小さなフィードフォワードネットワーク（FFN）のブロックを何層も積み重ねたアーキテクチャです。CNN が畳み込み層を積み重ねるように、Transformer は Attention ブロックを積み重ねます。

Transformer is an architecture that stacks blocks of Multi-head Attention and a small feed-forward network (FFN). Just as CNN stacks convolution layers, Transformer stacks Attention blocks.

flowchart TB IN["入力トークン列\n（各単語→ベクトル）"] subgraph BLK ["× N 層繰り返し"] direction TB MHA["Multi-head Attention\n（全単語ペアを参照）"] FFN["Feed-Forward Network\n（位置ごとの変換）"] MHA --> FFN end OUT["出力\n（次の単語の確率など）"] IN --> BLK --> OUT style IN fill:#E8F5E9,stroke:#2E7D32 style MHA fill:#EDE7F6,stroke:#7B1FA2 style FFN fill:#E3F2FD,stroke:#1565C0 style OUT fill:#FFF3E0,stroke:#E65100

Transformer = [Multi-head Attention → FFN] を N 層積み重ねたもの

Transformer = [Multi-head Attention → FFN] repeated N times

⑨ ROCm との接続

⑨ Connection to ROCm

Attention の核心計算 Q·Kᵀ と · V は、どちらも大きな行列積（GEMM）です。中学生向けに言えば、大きな表どうしの掛け算です。ROCm では、この種の計算は主に rocBLAS などの行列計算ライブラリが支えます。

The core computations Q·Kᵀ and · V are both large matrix multiplications (GEMM) — in simple terms, multiplying big tables of numbers. In ROCm, this kind of work is mainly supported by matrix-math libraries such as rocBLAS.

また、文が長くなると「単語どうしの組み合わせ」を全部比べる必要があり、メモリをたくさん使います。Flash Attention は、その計算を小さなかたまりに分けて、途中で使うメモリを減らす考え方です。gfx900 のような旧世代 GPU では GPU 内部の小さくて速い作業スペースが限られるため、恩恵が小さめになる場合があります。

As sequences get longer, Attention must compare many pairs of words, which uses a lot of memory. Flash Attention is the idea of breaking that work into smaller chunks so it needs less temporary memory. On older GPUs like gfx900, the small fast workspace inside the GPU is more limited, so the benefit can be smaller.

ここで出る難しい言葉を一言で

Quick translations for the harder terms here

GEMM: 大きな表どうしの掛け算。Flash Attention: Attention を小分けにしてメモリ節約する工夫。GPU 内部の小さくて速い作業スペース: その場で計算を回すための小さな作業机、くらいの理解で十分です。

GEMM: multiplying large tables of numbers. Flash Attention: a way to split Attention into smaller chunks to save memory. The GPU's small fast workspace: think of it as a tiny desk inside the GPU used for immediate work.

まとめ

Summary

畳み込みは「近くだけ見る」、Attention は「全部見て重みをつける」
Convolution sees only nearby; Attention sees all and assigns weights
softmax がスコアを「どこにどれだけ注目するかの配分」に変える
softmax turns scores into an attention distribution across words
Q（何が欲しいか）と K（各単語の目印）でスコアを計算し、配分にしたがって V（中身）を集める
Q (what I want) and K (each word's label) → score → collect V (content) by distribution
Multi-head Attention は複数の見方で同時に文を参照する仕組み
Multi-head Attention reads the sequence from multiple perspectives simultaneously
Transformer は Attention ブロックを積み重ねたもので、LLM の主要構成要素
Transformer stacks Attention blocks and is the backbone of modern LLMs
核心計算は大きな行列計算で、ROCm では rocBLAS 系の経路が重要
The core computation is large matrix math, and rocBLAS-style paths are important in ROCm

次に読むなら: イメージでわかる深層学習

Continue with: Visual Deep Learning

畳み込み・フィルタ・MIOpen solver の関係を、さらに詳しく解説しています。

Convolution, filters, and MIOpen solvers — explained in detail.

イメージでわかる深層学習 → Visual Deep Learning →