ROCmで見る Attention

See Attention in ROCm

ROCm で見るシリーズ

Scaled Dot-Product Attention を手計算し、nn.MultiheadAttention と比較してしくみを確認します。

Compute Scaled Dot-Product Attention by hand, then compare with nn.MultiheadAttention.

📘 対応する理論ページ: イメージでわかる Attention — Q / K / V の役割とスケーリングのしくみ

📘 Theory page: Visual Attention — Roles of Q / K / V and how scaling works

このページの読み方: 1) まず「スコア」の表を見る 2) 次に「注目度」の表を見る 3) 最後に「出力」の表を見る、の順で読むと入りやすいです。

How to read this page: 1) First look at the “score” table 2) Then the “attention weight” table 3) Finally the “output” table. That order makes it much easier.

注記: このページでも GPU 指定は "cuda" 表記ですが、PyTorch ROCm の互換 API 名です。Attention 計算の実行経路は HIP/ROCm です。

Note: GPU selection still appears as "cuda" here because of PyTorch ROCm compatibility naming. Attention math still runs through HIP/ROCm execution paths.

1. なにをするか

1. What We'll Do

3 つの token を、それぞれ 4 個の数字で表したおもちゃデータで、Scaled Dot-Product Attention を forward します。ここで token は「単語のような 1 かたまり」くらいに読めば十分です。

Run Scaled Dot-Product Attention on a toy input with 3 tokens, each represented by 4 numbers. Here, it is enough to read a token as “one word-like piece.”

このページでまず見えれば OK: スコア表は「似ているか」、注目度表は「どれくらい見るか」、出力表は「混ぜた結果」です。

What is enough to see first: the score table means “how similar,” the attention-weight table means “how much to look,” and the output table is “the mixed result.”

2. 手計算で見る Attention

2. Attention by Hand

import torch
import torch.nn.functional as F
import math

assert torch.version.hip is not None, "This page expects PyTorch ROCm build"
DEVICE = torch.device("cuda")

# 3トークン, 次元4 のダミー入力
x = torch.tensor([
    [1.0, 0.0, 1.0, 0.0],  # トークン A
    [0.0, 1.0, 0.0, 1.0],  # トークン B
    [1.0, 1.0, 0.0, 0.0],  # トークン C
]).to(DEVICE)

# Self-attention: Q = K = V = x (簡略版)
Q, K, V = x, x, x
d_k = Q.size(-1)  # = 4

# Scaled Dot-Product Attention
scores = torch.matmul(Q, K.T) / math.sqrt(d_k)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, V)

print("スコア (QK^T / √d):")
print(scores.cpu())
print("注目度 (softmax):")
print(weights.cpu())
print("出力:")
print(output.cpu())

コードと式の対応: Attention の主役は「行列をかける → 割合にする → もう一度かける」の 3 段です。

Code-to-math link: The core of attention is a three-step pattern: “multiply matrices → turn the result into proportions → multiply once more.”

$$X \in \mathbb{R}^{3 \times 4},\quad Q = K = V = X$$

$$S = \frac{QK^T}{\sqrt{d_k}} \in \mathbb{R}^{3 \times 3},\quad W = \mathrm{softmax}(S) \in \mathbb{R}^{3 \times 3},\quad O = WV \in \mathbb{R}^{3 \times 4}$$

scores = torch.matmul(Q, K.T) は $QK^T$、weights = F.softmax(...) は $W$、output = torch.matmul(weights, V) は $WV$ に対応します。shape だけ追っても、3×4 → 3×3 → 3×4 の流れが見えます。

scores = torch.matmul(Q, K.T) corresponds to $QK^T$, weights = F.softmax(...) to $W$, and output = torch.matmul(weights, V) to $WV$. Even if you only track shapes, you can see the flow 3×4 → 3×3 → 3×4.

コードを全部いっぺんに理解しなくて大丈夫です。最初は次の 4 行だけ追えば十分です。

You do not need to understand the whole code at once. At first, these four lines are enough.

Q, K, V = x, x, x: まずは簡略版として、同じ入力をそのまま Q/K/V に使っています。

Q, K, V = x, x, x: As a simplified version, the same input is reused for Q/K/V.

scores = torch.matmul(Q, K.T) / math.sqrt(d_k): 誰と誰がどれくらい似ているかを表にしています。

scores = torch.matmul(Q, K.T) / math.sqrt(d_k): This makes a table of how similar each pair is.

weights = F.softmax(...): 似ている度合いを、注目の割合に変えています。

weights = F.softmax(...): This turns similarity into attention proportions.

output = torch.matmul(weights, V): 注目の割合で情報を混ぜて、最後の出力を作っています。

output = torch.matmul(weights, V): This mixes information using those attention proportions to make the final output.

3. 実行するとどうなるか

3. What Happens When You Run It

スコア (QK^T / √d): tensor([[1.0000, 0.0000, 0.5000], [0.0000, 1.0000, 0.5000], [0.5000, 0.5000, 1.0000]]) 注目度 (softmax): tensor([[0.4219, 0.1553, 0.2561], [0.1553, 0.4219, 0.2561], [0.2312, 0.2312, 0.4810]]) 出力: tensor([[0.6780, 0.4114, 0.4219, 0.1553], [0.4114, 0.6780, 0.1553, 0.4219], [0.4624, 0.4624, 0.2312, 0.2312]])

注目度の行列は、各トークンが他のトークンを「どれくらい見るか」の表です。たとえばトークン A の行では、自分自身を一番強く見ていて、次にトークン C を見ています。

The attention-weight matrix is a table of how much each token looks at the others. For example, in token A’s row, it looks most strongly at itself and then at token C.

最初は細かい数値を全部追わなくて大丈夫です。1 行を「1 つのトークンが、誰をどれくらい見るか」と読むだけで十分です。

At first, you do not need to chase every number. It is enough to read one row as “who this token looks at, and by how much.”

理論ページとの対応:
Q (Query): 「何を探したいか」 — 各トークンの問い合わせベクトル
K (Key): 「何を持っているか」 — 各トークンの識別ベクトル
V (Value): 「実際に渡す情報」 — 注目度で重み付けされる
√d_k で割る: スコアが大きくなりすぎて softmax がほぼ 0/1 に偏るのを防ぐ

Theory page mapping:
Q (Query): "What am I looking for?" — each token's query vector
K (Key): "What do I have?" — each token's identifier vector
V (Value): "What to actually pass" — weighted by attention scores
Divide by √d_k: Prevents scores from getting so large that softmax collapses to near-0/1

4. nn.MultiheadAttention で同じことをする

4. Same Thing with nn.MultiheadAttention

import torch.nn as nn

# MultiheadAttention: 1ヘッド, 次元4
mha = nn.MultiheadAttention(
    embed_dim=4, num_heads=1, batch_first=True
).to(DEVICE)

# 入力を (batch=1, seq=3, dim=4) に
x_batch = x.unsqueeze(0)

with torch.no_grad():
    out, attn_weights = mha(x_batch, x_batch, x_batch)

print("MHA 注目度:")
print(attn_weights.squeeze().cpu())
print("MHA 出力 shape:", out.shape)

nn.MultiheadAttention は、今見た「似ているかを見る → 注目の割合にする → 情報を混ぜる」を、ちゃんとした部品としてまとめたものです。

nn.MultiheadAttention packages the process we just saw — “measure similarity → turn it into attention proportions → mix information” — as a proper module.

内部では Q, K, V を作り直してから同じ構造を実行するので、手計算版と数値は違っても、流れは同じです。

Inside, it first creates new Q, K, and V values and then runs the same structure, so the numbers differ from the hand version but the flow is the same.

5. 裏では何が起きているか

5. Under the Hood

やさしく言うと、Attention は「表どうしをかける」「softmax で割合にする」「もう一度表をかける」の組み合わせです。

In simpler words, Attention is a combination of “multiply tables,” “turn them into proportions with softmax,” and “multiply tables again.”

QK^T と weights × V は行列積なので、ROCm では主に rocBLAS 側の計算へつながります。softmax は専用 kernel です。

QK^T and weights × V are matrix multiplies, so on ROCm they mainly connect to rocBLAS. Softmax uses its own kernel.

Transformer ではこの流れが何度も繰り返されるので、LLM は大量の行列計算を必要とします。

Transformers repeat this flow many times, which is why LLMs need so much matrix computation.

6. ROCmで観測するポイント

6. What to Observe on ROCm

# ROCm ビルド確認
python -c "import torch; print('torch.version.hip =', torch.version.hip)"

# Attention 実行時の HIP 呼び出し観測
rocprof --hip-trace python attention_example.py

🏭 マルチヘッドの意味: 1 つの Attention は 1 種類の関係しか学べません。MultiheadAttention は複数の「視点」から同時に注目し、結果を結合します（理論ページでは「複数の虫眼鏡」に例えています）。

🏭 Why multiple heads? One Attention head can only learn one kind of relationship. MultiheadAttention attends from multiple "viewpoints" simultaneously and concatenates results (the theory page calls them "multiple magnifying glasses").

📖 次に読むなら: 📖 Read next:
イメージでわかる Attention — 理論ページに戻る
ROCmで見る推論の流れ — Attention が推論の中で使われる場面
ROCmで見る線形代数 — QK^T の根底にある行列積 Visual Attention — Back to the theory page
See Inference Flow in ROCm — Where Attention is used during inference
See Linear Algebra in ROCm — Matrix multiply underlying QK^T