ROCmで見る推論の流れ

See Inference Flow in ROCm

ROCm で見るシリーズ

学習済みモデルで forward だけ実行する推論を PyTorch ROCm で試し、学習との違いを比べます。

Run forward-only inference on a trained model in PyTorch ROCm and compare it with training.

📘 対応する理論ページ: イメージでわかる推論の流れ — 推論がなぜ高速か・KV Cache などを図解

📘 Theory page: Visual Inference Flow — Why inference is fast, KV Cache, and more

このページの読み方: 1) まず model.training = False を見る 2) 次に pred.requires_grad = False を見る 3) 最後に weight changed = False を見れば、このページの主眼はつかめます。

How to read this page: 1) First look at model.training = False 2) Then check pred.requires_grad = False 3) Finally check weight changed = False. That is enough to catch the main point.

注記: PyTorch ROCm では GPU デバイス名に "cuda" を使います。これは互換 API 名で、実行は HIP/ROCm 側です。

Note: PyTorch ROCm uses "cuda" as the GPU device name. This is a compatibility API name; execution still runs on the HIP/ROCm side.

1. なにをするか

1. What We'll Do

このページの主役は 3 つです。① eval() で推論モード ② no_grad() で勾配追跡しない ③ 重みを更新しない。この 3 点が見える最小例を実行します。

This page focuses on three things: 1) eval() for inference mode 2) no_grad() to disable gradient tracking 3) no weight updates. We run a minimal example that makes these points visible.

困ったらこの 3 語だけ: eval() は推論モード、no_grad() は学習用の記録を取らない、weight changed = False は重みを直していない、です。

If you get stuck, keep just these three ideas: eval() means inference mode, no_grad() means no training-related bookkeeping, and weight changed = False means the weights were not updated.

2. 最小コード

2. Minimal Code

import torch
import torch.nn as nn

# 0) 実行前チェック（主役ではなく前提確認）
print("torch.version.hip =", torch.version.hip)
assert torch.version.hip is not None, "This example expects PyTorch built for ROCm"

# ROCm 版 PyTorch でも device 文字列は "cuda" を使う
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device =", DEVICE)

# 1) 説明しやすい小さなモデル（重みを手で固定）
model = nn.Linear(3, 1).to(DEVICE)
with torch.no_grad():
	model.weight[:] = torch.tensor([[0.1, 0.2, 0.3]], device=DEVICE)
	model.bias[:] = torch.tensor([0.5], device=DEVICE)

# 2) 推論モードに切り替える
model.eval()
print("model.training =", model.training)  # False なら推論モード

# 3) 入力
x = torch.tensor([[1.0, 2.0, 3.0]], device=DEVICE)
w_before = model.weight.detach().clone()

# 4) 推論: forward のみ（勾配は追跡しない）
with torch.no_grad():
	pred = model(x)

w_after = model.weight.detach().clone()

print("input:", x.cpu())
print("prediction:", pred.cpu())
print("pred.requires_grad:", pred.requires_grad)
print("weight changed:", not torch.allclose(w_before, w_after))

コードと式の対応: この nn.Linear(3, 1) は、「1 行 3 列の入力に、3 個の重みをかけて、最後に bias を足す」です。

Code-to-math link: This nn.Linear(3, 1) means “take a 1-by-3 input, apply 3 weights, then add a bias.”

$$ x = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}, \quad w = \begin{bmatrix} 0.1 \\ 0.2 \\ 0.3 \end{bmatrix}, \quad b = 0.5 $$

$$\hat{y} = xw + b = 1 \cdot 0.1 + 2 \cdot 0.2 + 3 \cdot 0.3 + 0.5 = 1.9$$

pred = model(x) はこの $\hat{y}$ を計算しているだけで、推論ではそのあとに backward() も optimizer.step() も呼びません。

pred = model(x) is just computing this $\hat{y}$, and in inference we do not call backward() or optimizer.step() afterward.

コードは長く見えますが、最初は次の 4 行だけ追えば十分です。

The code may look long, but at first it is enough to follow these four lines.

model.eval(): 推論モードへ切り替えます。

model.eval(): This switches the model into inference mode.

with torch.no_grad():: 学習用の記録を取らない、と伝えています。

with torch.no_grad():: This says “do not keep training-related records.”

pred = model(x): 入力から答えを出す、forward です。

pred = model(x): This is the forward step that produces the answer.

weight changed: 重みが変わっていないことを確認しています。

weight changed: This checks that the weights did not change.

3. 実行するとどうなるか

3. What Happens When You Run It

torch.version.hip = 6.x.x device = cuda model.training = False input: tensor([[1., 2., 3.]]) prediction: tensor([[1.9000]]) pred.requires_grad: False weight changed: False

見たいポイント: model.training=False（推論モード）、pred.requires_grad=False（勾配追跡なし）、weight changed=False（重み更新なし）。この 3 つが同時に成立しているのが「推論」です。

What to look for: model.training=False (inference mode), pred.requires_grad=False (no gradient tracking), and weight changed=False (no weight update). These three together define inference here.

4. 学習との違い: 何が省かれるか

4. Differences from Training: What's Skipped

学習 (training-flow): forward → loss → backward → optimizer.step()
推論 (ここ): forward のみ

Training (training-flow): forward → loss → backward → optimizer.step()
Inference (here): forward only

このコードで確認できる差: eval() を使う / no_grad() を使う / 重みを更新しない

Differences confirmed by this code: uses eval() / uses no_grad() / does not update weights

🏭 何が速くなるか:

🏭 What gets faster:

① 学習用の記録が不要 — 途中の情報をたくさん持たなくてよい
② backward が不要 — 直し方を逆向きに計算しない
③ 重み更新が不要 — optimizer の処理が走らない

① No training record-keeping — fewer intermediate values need to be kept
② No backward pass — there is no reverse pass to compute corrections
③ No weight update — optimizer work does not run

torch.no_grad() は PyTorch に「勾配を追跡しない」と伝えるコンテキストマネージャです。推論時には必ず使います。LLM（Ollama など）が会話するとき、内部ではこの forward-only が繰り返されています。

torch.no_grad() tells PyTorch "don't track gradients." Always use it for inference. When an LLM (like Ollama) chats with you, this forward-only loop is what runs internally.

ROCm の見方では、推論は主に「答えを出すための計算」だけが走ります。学習で必要だった backward や update がないぶん、軽くなります。

From a ROCm point of view, inference mainly runs the calculations needed to produce an answer. It is lighter because backward and updates are absent.

5. ROCmで観測するポイント

5. What to Observe on ROCm

# ROCm ビルド確認
python -c "import torch; print('torch.version.hip =', torch.version.hip)"

# 推論（forward only）の HIP 呼び出し観測
rocprof --hip-trace python inference_flow_example.py

📖 次に読むなら: 📖 Read next:
イメージでわかる推論の流れ — 理論ページに戻る
ROCmで見る学習の流れ — backward + optimizer を含む学習例
ROCmで見る Attention — 推論の中核、Attention を実際に動かす Visual Inference Flow — Back to the theory page
See Training Flow in ROCm — Training with backward + optimizer
See Attention in ROCm — Run the core mechanism of inference

ROCmで見る 推論の流れ