Experiment History | Backend-Dependent Behavior on Legacy HBM GPUs

1) Problem framing & environment isolation 1) 問題設定と環境統制

Did:実施: Defined the core question — "support policy vs. actual executability" — and set up two independent Ollama services on the same host, each pinned to a single backend. Fixed the model (qwen3.5:2b, Q8_0, 2.7 GB), prompt, and token budget across all runs. 「公式サポート状況」と「実際の実行可能性」を分離して問いを設定。同一ホスト上に2つの独立した Ollama サービスを構築し、それぞれを単一バックエンドに固定。モデル・プロンプト・トークン上限を全 run で統一した。
Setup:構成: Two systemd services, each with exclusive backend libraries: バックエンドライブラリを占有させた2つの systemd サービス:

ROCm service :11435 → libggml-hip.so (ollama 0.17.5 / 0.17.6) + HSA_OVERRIDE_GFX_VERSION=9.0.0 Vulkan service :11434 → libggml-vulkan.so (ollama 0.17.4) + no override

Result:結果: A reproducible testbed where backend is the only variable. Confounders (prompt, model, quantization, token budget) are held constant. バックエンドのみを変数とした再現可能な実験基盤。プロンプト・モデル・量子化・トークン数を固定して交絡を排除。

2) Build-time & runtime gate analysis 2) ビルド時・実行時ゲート解析

Did:実施: Inspected the CMake HIP target filter and runtime device recognition to understand exactly where gfx900 stands in the build pipeline. CMake HIP ターゲットフィルタと実行時デバイス認識を精査し、gfx900 がビルドパイプライン上でどう扱われているかを確認した。

# ollama/CMakeLists.txt : 125-128 if(NOT AMDGPU_TARGETS) find_package(hip REQUIRED) list(FILTER AMDGPU_TARGETS INCLUDE REGEX "^gfx(94[012]|101[02]|1030|110[012]|120[01])$") endif() # Permitted targets: gfx940/941/942 (CDNA3), gfx1010/1012 (RDNA1), # gfx1030 (RDNA2), gfx1100/1101/1102 (RDNA3), gfx1200/1201 (RDNA4) # gfx900 is ABSENT → excluded from default HIP builds

However, the runtime recognized gfx900 without error. The rocm-smi log and runner journal both confirmed device enumeration: ただし、実行時には gfx900 がエラーなく認識された。rocm-smi ログとランナージャーナルが device 列挙を確認:

# run_20260307_012643/backend_probe.txt (ROCm service, 2026-03-07T01:26:43+09:00) GPU[0] GFX Version: gfx900 Device 0: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64 load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so

Result:結果: gfx900 is absent from the default CMake filter (not officially supported), but the shipped libggml-hip.so on this host — likely built with explicit AMDGPU_TARGETS=gfx900 — allowed runtime execution. The HSA_OVERRIDE_GFX_VERSION=9.0.0 environment variable was required to make ROCm recognize the device. gfx900 はデフォルト CMake フィルタに含まれないが（公式サポート外）、このホストの libggml-hip.so（おそらく AMDGPU_TARGETS=gfx900 を明示してビルド）により実行経路が残っていた。ROCm でのデバイス認識には HSA_OVERRIDE_GFX_VERSION=9.0.0 が必要。

Caution: ROCm results in this study were obtained under override-assisted configuration. This does not represent an officially supported deployment. 注意: 本研究の ROCm 結果は override-assisted 構成で得られたもの。公式サポート環境ではない。

3) Semantic trace of `num_gpu` through 9 source points 3) `num_gpu` の意味を9ファイルにわたって追跡

Did:実施: Traced the parameter end-to-end to confirm it means "number of layers to offload to GPU," not "number of GPU devices." パラメータを端から端まで追跡し、「GPU デバイス数」ではなく「GPU にオフロードする層数」であることを確認した。

Source fileソースファイル	Symbol / valueシンボル / 値	Key evidence根拠
ollama/_types.py:104-110	`num_gpu: Optional[int]`	Python field, forwarded as-is	Python フィールド、そのまま転送
ollama/api/types.go:604	`NumGPU int`	Go struct field	Go 構造体フィールド
ollama/api/types.go:1075	`NumGPU: -1` (default)	Default = let server decide	デフォルト: サーバーに委任
ollama/cmd/interactive.go:112	`/set parameter num_gpu <layers>`	CLI help text says "layers"	CLI ヘルプが "layers" と明記
ollama/llm/server.go:992	`requestedLayers`	Mapped from NumGPU	NumGPU からマップ
ollama/llm/server.go:1063-1076	`assignLayers()`	Distributes layers across devices	デバイス間でレイヤを配分
runner/llamarunner/runner.go:906-924	`NumGpuLayers`	Runner-side layer count	ランナー側レイヤ数
ollama/llama/llama.go:264-267	`cparams.n_gpu_layers`	C-bridge to llama.cpp	llama.cpp への C ブリッジ
include/llama.h:289	`// number of layers to store in VRAM`	Official definition: layer count	公式定義: レイヤ数

Result:結果: num_gpu=1 means "offload 1 transformer layer to GPU," not "use 1 GPU." The Vulkan journal confirmed layer-level granularity in practice: num_gpu=1 は「Transformer 層を 1 層 GPU にオフロード」を意味し、「GPU 1 台使用」ではない。Vulkan ジャーナルで実際の粒度を確認:

# run_20260307_013050/ollama_journal_since_start.txt (num_gpu=1) offloaded 1/25 layers to GPU model weights device=Vulkan0 size=53.1 MiB model weights device=CPU size=3.0 GiB kv cache device=Vulkan0 size=8.0 MiB compute graph device=Vulkan0 size=257.1 MiB

4) First matched backend comparison 2026-03-07 4) 初回バックエンド比較実験 2026-03-07

Did:実施: Ran qwen3.5:2b with num_gpu ∈ {0, 1, 2, −1} on both backends in the same session. 同一セッション内で両バックエンドに num_gpu ∈ {0, 1, 2, −1} を投入した。

run_id	Backend	num_gpu	Status状態	elapsed_sec	eval_count
run_20260307_012643	ROCm 0.17.4	0	✅ ok	46.69 s	512
run_20260307_012643	ROCm 0.17.4	1	✅ ok	48.74 s	512
run_20260307_012643	ROCm 0.17.4	2	✅ ok	47.75 s	512
run_20260307_012643	ROCm 0.17.4	−1	✅ ok	44.27 s	512
run_20260307_013050	Vulkan 0.17.4	0	✅ ok	45.00 s	512
run_20260307_013050	Vulkan 0.17.4	1	❌ error	2.67 s	—
run_20260307_013050	Vulkan 0.17.4	2	❌ error	7.64 s	—
run_20260307_013050	Vulkan 0.17.4	−1	❌ error	9.82 s	—

Result:結果: ROCm completed all four conditions (4/4 ok); Vulkan succeeded only at num_gpu=0 (CPU execution) and crashed within seconds when any layer was offloaded to GPU. Vulkan Vulkan also confirmed device capabilities: ROCm は全条件で完走（4/4 ok）。Vulkan は num_gpu=0（CPU 実行）のみ成功し、1 層でも GPU にオフロードするとクラッシュ。Vulkan のデバイス能力も確認:

int dot: 0 — no hardware acceleration for VK_KHR_shader_integer_dot_product int dot: 0 — VK_KHR_shader_integer_dot_product のハードウェアアクセラレーションなし
matrix cores: none — no cooperative matrix support (VK_KHR_cooperative_matrix) matrix cores: none — cooperative matrix（VK_KHR_cooperative_matrix）非サポート

5) Failure localization — SIGSEGV stack trace & Vulkan classification 5) 故障局在化 — SIGSEGV スタックトレースと Vulkan 分類

Did:実施: Extracted the full crash journal and cross-referenced it with Vulkan backend source (ggml-vulkan.cpp) to understand why gfx900 fails in the compute path. クラッシュジャーナルを全抽出し、Vulkan バックエンドソース（ggml-vulkan.cpp）と照合して compute path でなぜ gfx900 が落ちるかを調べた。

# run_20260307_013050/ollama_journal_since_start.txt — 2026-03-07T01:31:39+09:00 # (1 second after "offloaded 1/25 layers to GPU" at 01:31:38) SIGSEGV: segmentation violation PC=0x7efbc5b801b5 m=41 sigcode=1 addr=0x3a8000 signal arrived during cgo execution goroutine 853 [syscall]: runtime.cgocall(0x55e54a0e2710, 0xc0000bdaa0) runtime/cgocall.go:167 github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(...) _cgo_gotypes.go:979 github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(...) ml/backend/ggml/ggml.go:825 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...) runner/ollamarunner/runner.go:716

The crash occurs in C function ggml_backend_sched_graph_compute_async, called via cgo from computeBatch. Model loading completed normally immediately before — the failure is in the compute graph execution, not loading. クラッシュは computeBatch から cgo 経由で呼ばれた C 関数 ggml_backend_sched_graph_compute_async 内で発生。モデルロードは直前に正常完了 — 失敗はcompute graph 実行時点。
Source trace: gfx900 (wave64-only) is classified as AMD_GCN in Vulkan backend, with subgroup arithmetic explicitly disabled: ソーストレース: wave64 固定の gfx900 は Vulkan バックエンドで AMD_GCN に分類され、subgroup arithmetic が明示的に無効化される:

# ggml-vulkan/ggml-vulkan.cpp : 261-297 — AMD_GCN detection if (subgroup_size_control_props.maxSubgroupSize == 64 && subgroup_size_control_props.minSubgroupSize == 64) { return vk_device_architecture::AMD_GCN; // gfx900 matches → AMD_GCN } # ggml-vulkan.cpp : 3964 — subgroup arithmetic disabled for AMD_GCN const bool use_subgroups = device->subgroup_arithmetic && device->architecture != vk_device_architecture::AMD_GCN;

Detailed legend: SIGSEGV = the process touched invalid memory and crashed. cgo execution = the Go runner was inside native C/C++ backend code. computeBatch = actual inference compute had started. AMD_GCN = the backend classified this GPU as an older wave64-only AMD path. subgroup arithmetic disabled = one optimized parallel route is excluded for this class (confirmed from source), so the backend uses a different compute path. 詳細凡例: SIGSEGV = 不正なメモリアクセスでプロセスが落ちたこと。cgo execution = Go ランナーがネイティブの C/C++ backend code 内にいたこと。computeBatch = 実際の推論計算が始まっていた段階。AMD_GCN = この GPU が古い wave64 固定の AMD 系経路として分類されたこと。subgroup arithmetic disabled = この系統では最適化された並列経路の1つが除外されたことが確認でき、別の compute path が使われること。

6) Reproduction & quantitative crash-timing analysis 2026-03-08 6) 再現確認とクラッシュ timing 定量分析 2026-03-08

Did:実施: Ran a full second matched set (same conditions) on 2026-03-08, added a 5-epoch Vulkan stability test at num_gpu=0, then aggregated all 17 runs with analyze_runs.py. 2026-03-08 に同一条件での 2 セット目を実施。num_gpu=0 での Vulkan 連続安定性確認（5 エポック）を追加し、analyze_runs.py で全 17 run を集計した。

run_id	Backend	num_gpu	Status状態	elapsed_sec	eval_count
run_20260308_201305	ROCm 0.17.6	0	✅ ok	47.98 s	512
run_20260308_201305	ROCm 0.17.6	1	✅ ok	49.18 s	512
run_20260308_201305	ROCm 0.17.6	2	✅ ok	48.30 s	512
run_20260308_201305	ROCm 0.17.6	−1	✅ ok	45.17 s	512
run_20260308_201641	Vulkan 0.17.4	0	✅ ok	45.53 s	512
run_20260308_201641	Vulkan 0.17.4	1	❌ error	2.70 s	—
run_20260308_201641	Vulkan 0.17.4	2	❌ error	7.64 s	—
run_20260308_201641	Vulkan 0.17.4	−1	❌ error	9.80 s	—

Result — crash timing across all 3 matched Vulkan runs (n=3 each): 結果 — Vulkan クラッシュ timing（全3回分、各 n=3）:

Vulkan num_gpu=1 crash time

2.653 s

std = ±0.059 s | n=3 | [2.586, 2.672, 2.700]

Vulkan num_gpu=2 crash time

7.630 s

std = ±0.014 s | n=3 | [7.614, 7.637, 7.640]

Vulkan num_gpu=−1 crash time

9.819 s

std = ±0.016 s | n=3 | [9.805, 9.816, 9.837]

Vulkan num_gpu=0 stability (run_20260308_201924)

5 / 5 ok

elapsed 42–50 s | all eval_count=512

Interpretation: std values of 14–59 ms across 3 runs spanning two days indicate a deterministic failure — a fixed point in the compute path, not a resource race or timing fluke. ROCm version update (0.17.5 → 0.17.6) produced no behavioral change. 解釈: 2 日間・3 run にわたって std が 14〜59 ms という極小値 → クラッシュは deterministic（compute path 上の固定点）であり、リソース競合や偶発的なタイミング問題ではない。ROCm バージョン更新（0.17.5 → 0.17.6）は挙動に影響しなかった。

What “gfx900” and “DP4A emulation” mean: gfx900 is AMD's architecture name for Vega 10 / Vega 56 generation GPUs. In this study, it means we are testing an older GPU target with limited modern acceleration features. DP4A is a fast integer dot-product instruction often useful for quantized inference. If hardware support is missing or weak, software may imitate the same math using slower substitute operations. That is what DP4A emulation means: keep the math result, but travel through a different code path with different speed and possibly different stability behavior. gfx900 と DP4A エミュレーションの意味: gfx900 は Vega 10 / Vega 56 世代GPUに対する AMD のアーキテクチャ名。この研究では「比較的新しい高速化機能が限られた旧世代GPUを検証している」という意味になる。DP4A は量子化推論で有用な高速整数ドット積命令。もしハードウェア対応が無い、または弱い場合、ソフトウェアは別の演算を組み合わせて同じ計算結果をまねることがある。これが DP4A エミュレーション で、計算結果は保てても、通るコード経路・速度・安定性は変わりうる。

7) Model-dependency probe & crash-phase localization 2026-03-08 7) モデル依存性の探索とクラッシュ段階の絞り込み 2026-03-08

Did:実施: Ran two additional experiments: (A) tested a different model (phi4-mini:latest) on Vulkan with num_gpu=1; (B) reduced NUM_PREDICT from 512 to 128 to probe whether the crash time scales with workload length. 2つの追加実験を実施: (A) 別モデル（phi4-mini:latest）を Vulkan num_gpu=1 で実行; (B) NUM_PREDICT を 512 → 128 に削減し、クラッシュ時間がワークロード長に比例するか確認。

run_id	Model	NUM_PREDICT	num_gpu	Status状態	elapsed_sec	eval_count
run_20260308_212345	phi4-mini:latest	512	0	✅ ok	18.82 s	318
run_20260308_212345	phi4-mini:latest	512	1	✅ ok — no crash	31.55 s	512
run_20260308_212254	qwen3.5:2b Q8_0	128	1	❌ error	2.858 s	—
(reference)	qwen3.5:2b Q8_0	512	1	❌ error	2.653 s (mean)	—

NUM_PREDICT 128 vs 512 crash Δ

+0.2 s

2.858 s vs 2.653 s mean — crash is NOT proportional to token budget

phi4-mini Vulkan num_gpu=1

✅ ok

31.55 s, eval_count=512 — crash is model-specific, not universal

Key finding (A): phi4-mini:latest completed inference on Vulkan with num_gpu=1 without crashing (run_20260308_212345). This directly falsifies the claim "Vulkan GPU offload always fails on gfx900." The failure is specific to qwen3.5:2b (Q8_0) — likely related to its quantization scheme, tensor layout, or memory access pattern in the Vulkan compute path. 主要知見 (A): phi4-mini:latest は Vulkan num_gpu=1 でクラッシュなしに完走（run_20260308_212345）。「gfx900 の Vulkan GPU offload は常にクラッシュする」という主張を直接反証。障害は qwen3.5:2b（Q8_0）固有 — 量子化方式・テンソルレイアウト・Vulkan compute path でのメモリアクセスパターンが原因の可能性。

Key finding (B): Reducing NUM_PREDICT 4× (512 → 128) shifted crash time by only +0.2 s. The crash occurs at a fixed point in the compute initialization sequence — independent of how many tokens the model is asked to generate. 主要知見 (B): NUM_PREDICT を 4 分の 1（512 → 128）に削減してもクラッシュ時間は +0.2 s しか変わらない。クラッシュは compute 初期化シーケンスの固定点で発生し、生成トークン数には非依存。

Easy explanation: Think of this as testing the same toy car on two roads. On one road, the car keeps running. On the other road, it crashes only when it reaches a special fast lane. That means the car itself is not simply broken; the trouble is in one specific route. Also, because another model passed that route, the problem is not “Vega always fails,” but “this backend + this model + this compute path” is the risky combination. やさしい解説: これは「同じおもちゃの車を2本の道で走らせた」ようなもの。片方の道ではずっと走れるのに、もう片方では特別な高速レーンに入った時だけ転ぶ。つまり、車そのものが全部だめなのではなく、特定の道筋に問題がある。しかも別のモデルはその道を通れたので、「Vega はいつも失敗する」ではなく、「このバックエンド + このモデル + この計算経路」の組み合わせが危ない、と分かった。

8) Static code audit — ROCm library stack (MIOpen / rocBLAS / CK / Tensile) 2026-03-13 8) 静的コード監査 — ROCm ライブラリスタック（MIOpen / rocBLAS / CK / Tensile） 2026-03-13

Did:実施: Traced all gfx900 computation paths through MIOpen, rocBLAS, hipBLASLt, Composable Kernel (CK), and Tensile source code to map which solver paths remain active and how fallbacks are structured. Source ground truth: ROCm_AMD_Repo under rocm-libraries/. MIOpen・rocBLAS・hipBLASLt・Composable Kernel（CK）・Tensile のソースコードを通じて gfx900 向けの計算経路をすべてトレースし、どの solver 経路が残存しているか・どこでフォールバックするかを確認した。根拠は ROCm_AMD_Repo 配下の rocm-libraries/。

MLIR iGEMM — explicit gfx900 exclusion (commit 2407d2f, 2021-12-22):
conv_mlir_igemm_{fwd,bwd,wrw}.cpp each contain if(StartsWith(device_name, "gfx900")) return false;.
All three files were patched in a single commit by Zhuoran Yin (AMD), citing a private LLVM repo issue llvm-project-private/issues/389.
(This issue is not publicly accessible; its content cannot be verified from outside. Accordingly, what can be stated here is limited to the observable reference relationship and gating pattern in the public codebase.) MLIR iGEMM — gfx900 を明示除外（コミット 2407d2f, 2021-12-22）:
conv_mlir_igemm_{fwd,bwd,wrw}.cpp の全 3 ファイルにそれぞれ if(StartsWith(device_name, "gfx900")) return false; が存在。
AMD 社員によるコミットで 3 ファイルを一括変更し、llvm-project-private/issues/389 が参照されていることが git log から確認できる。
（この issue は非公開であり、本文は外部から確認できない。公開側から確認できるのは参照関係と gating の痕跡のみ。）

// conv_mlir_igemm_fwd.cpp : 188 (same pattern in bwd.cpp:68, wrw.cpp:69) if(StartsWith(device_name, "gfx900")) { // Refer to https://github.com/ROCm/llvm-project-private/issues/389 return false; }

Solver / pathソルバ / 経路	gfx900 statusgfx900 状態	Source referenceソース根拠
`ConvMlirIgemm{Fwd\|Bwd\|WrW}`	excluded	conv_mlir_igemm_*.cpp:68-188
`ConvMlirIgemm{Fwd\|Bwd\|WrW}Xdlops`	excluded (XDLops guard)	implicitgemm_util.hpp:101-105
`ConvAsmImplicitGemmV4R1Dynamic{Fwd\|Bwd\|WrW}`	gfx900/gfx906 only	conv_asm_implicit_gemm_*_v4r1_dynamic.cpp:142-343
`ConvBinWinograd3x3U / ConvBinWinogradRxS*`	gfx900 permitted	conv_bin_wino*.cpp:61-270
`ConvWinoRxS (v21 path)`	gfx900/906 priority	conv_winoRxS.cpp:210
`CK inner_product<int8x4_t>` — dot4 fallback	sequential accumulate	inner_product.hpp:179-201
Tensile ISA (9,0,0) `v_dot4_i32_i8`	False	AsmCaps.py:128,155-159
rocBLAS `getLazyLoadingArch()`	gfx900 explicit map	tensile_host.cpp:238-240
rocBLAS hipBLASLt → Tensile fallback	multi-stage fallback	tensile_host.cpp:1232,1161
rocBLAS target list (CMake)	ROCm 5.6–7.1 maintained	rocblas/CMakeLists.txt:80-85

Result — three-layer structure confirmed (code_verified):
Maintain (build): gfx900 remains in rocBLAS CMake target lists through at least ROCm 7.1.
Manage (selection): MIOpen uses a candidate-list + IsApplicable filter — no explicit if-else fallback chain; old-generation paths remain in the registry and survive when applicable conditions are met.
Supplement (fallback): rocBLAS carries a multi-stage runtime fallback (hipBLASLt failure → Tensile retry; XF32 failure → FP32); CK carries a dot4-absent sequential accumulation path; Tensile carries per-ISA capability gating. 結果 — 3 層構造をコードで確認（code_verified）:
維持（build）: gfx900 は少なくとも ROCm 7.1 まで rocBLAS CMake ターゲットリストに残存。
管理（selection）: MIOpen は「候補列挙 + IsApplicable フィルタ」方式を採用 — 明示的な if-else フォールバックではなく、旧世代経路はレジストリに残り、適用条件を満たした場合に選択される。
補充（fallback）: rocBLAS は多段実行時フォールバック（hipBLASLt 失敗 → Tensile 再試行、XF32 失敗 → FP32）を実装済み。CK は dot4 非対応時の逐次積和経路を持つ。Tensile は ISA ごとの能力ゲートで制御。

9) Vega64 — FP32 solver selection confirmed on real hardware 2026-03-13 9) Vega64 実機 — FP32 solver 選択の動的確認 2026-03-13

Did:実施: Ran MIOpen convolution driver cases on Vega64 (ROCm 7.2 · MIOpen 3.5.1 · gfx900:xnack-), capturing solver names and kernel names via MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=6. Also probed FP16 and BFP16 to observe dtype-driven solver switching. Vega64 実機（ROCm 7.2・MIOpen 3.5.1・gfx900:xnack-）で MIOpen conv driver を実行し、MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=6 でソルバ名・カーネル名を収集。dtype による solver 分岐確認のため FP16・BFP16 も追加測定。

Case	dtype	Solver selected選択 solver	Kernel nameカーネル名	Verify検証
3×3, s1, N32 C64 K64	FP32	`ConvBinWinograd3x3U`	miopenSp3AsmConv3x3F	✅ ok
1×1, s1, N32 C64 K64	FP32	`ConvAsm1x1U`	miopenGcnAsmConv1x1U	✅ ok
3×3, s2, N32 C64 K128	FP32	`ConvHipImplicitGemmV4R1Fwd`	gridwise_conv_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer	✅ ok
3×3, s1, N32 C64 K64 g=2	FP32	`ConvBinWinogradRxSf2x3`	miopenSp3AsmConv_v21_1_3_gfx9_fp32_f2x3_stride1	✅ ok
3×3, s1, N16 C64 K64	FP16	`ConvOclDirectFwd`	—	✅ 4.32 ms
3×3, s1, N16 C64 K64	BFP16	`GemmFwdRest`	—	✅ 5.41 ms

Key findings (runtime_verified):

ConvMlirIgemm* was not selected in any FP32 case — consistent with the static code exclusion (Step 8).
Three distinct legacy ASM/HIP paths were observed in natural selection: Binary Winograd, direct ASM 1×1, and HIP implicit GEMM v4r1.
dtype switching (FP32→FP16→BFP16 on the same shape) produced three different solver families — the selection system is sensitive to dtype, not only shape.
All four FP32 cases passed forward verify: fallback_confirmed.

主要知見（runtime_verified）:

FP32 全ケースで ConvMlirIgemm* は選択されなかった — 静的コード解析（Step 8）の除外条件と整合。
自然選択で Binary Winograd・直接 ASM 1×1・HIP implicit GEMM v4r1 の 3 系統が観測された。
同一形状で dtype を FP32→FP16→BFP16 に変更すると solver family が変わり、dtype も選択に影響することが実測で判明。
FP32 全 4 ケースが forward verify 通過: fallback_confirmed。

10) INT8 deep probe — forced solvers, failure taxonomy, and disassembly 2026-03-13 10) INT8 深掘り — 強制 solver 実行・失敗分類・逆アセンブル 2026-03-13

Did:実施: Ran 9+ natural INT8 convolution cases of varying shapes/layouts, then used MIOpen's -S <solver> flag to forcibly invoke each solver family that static analysis suggested might be relevant. Extracted the generated kernel binary from the MIOpen user DB and disassembled it to check for dot-product instructions. 形状・レイアウトを変えた 9 ケース以上の INT8 Conv を自然実行した後、静的解析で候補となった各 solver を MIOpen の -S <solver> フラグで強制起動した。MIOpen ユーザDB からカーネルバイナリを取り出して逆アセンブルし、dot 系命令の有無を確認した。

Natural selection result: All 9+ INT8 cases selected ConvDirectNaiveConvFwd regardless of shape, layout (NCHW/NHWC), stride, or channel count. No non-naive INT8 solver was selected naturally. 自然選択の結果: 形状・レイアウト（NCHW/NHWC）・ストライド・チャンネル数を変えた全 9 ケース以上で ConvDirectNaiveConvFwd が選択された。naive 以外の INT8 solver は自然選択では観測できなかった。

Forced solver強制 solver	Failure mode失敗モード	Exit code / error終了コード / エラー
`ConvAsmImplicitGemmV4R1DynamicFwd_1x1`	GPU memory access faultGPU メモリアクセス fault	Memory access fault by GPU node-1Memory access fault by GPU node-1
`ConvMlirIgemmFwd`	MIIR invalid paramMIIR パラメータ不正	`MIIR_INVALID_PARAM · rc=0x7`
`ConvCkIgemmFwdV6r1DlopsNchw`	Not applicable (15 cases)適用不可（15 ケース）	`rc=0x3`
`ConvHipImplicitGemmFwdXdlops`	Assertion abort (INT8 + FP16 + BFP16)Assertion abort（INT8・FP16・BFP16）	`std::vector::operator[] __n < size() · EXIT=134`
`ConvHipImplicitGemmForwardV4R5Xdlops`	Code object build failed (INT8 + FP16 + BFP16)コードオブジェクトビルド失敗（INT8・FP16・BFP16）	`intrin_mfma_* / gcnasm_mfma_* undeclared · rc=0x7`
`ConvHipImplicitGemmGroupFwdXdlops`	Not applicable適用不可	`rc=0x3`

Disassembly (INT8 naive kernel from MIOpen user DB): 逆アセンブル（MIOpen ユーザ DB からの INT8 naive カーネル）:

# Extract from gfx900_64.ukdb (sqlite3) → bzip2 decompress → ELF64 amdgpu # Symbol: naive_conv_ab_nonpacked_fwd_nchw_int8_t_int32_t_int8_t llvm-objdump -d --triple=amdgcn ... v_dot4_i32_i8 → NOT FOUND v_dot4c_i32_i8 → NOT FOUND sdot4 → NOT FOUND v_mul_i32_i24 → present (element-wise multiply) v_add_u32 → present (accumulate)

Key findings:

The INT8 naive kernel on gfx900 performs element-wise multiply + accumulate (v_mul / v_add), not dot-product instructions — consistent with the CK inner_product.hpp dot4-absent fallback path confirmed in static analysis.
The Xdlops family produces two distinct failure modes depending on the solver: vector-bound assertion abort vs. MFMA intrinsic compile failure — the failure boundary is solver-family-level, not dtype-level.
MLIR iGEMM reaches the compile stage but fails with MIIR_INVALID_PARAM at parameter lowering — the static exclusion (return false) is the correct guard; forced execution falls further down.

主要知見:

gfx900 上の INT8 naive カーネルは dot 系命令を使わず、要素ごとの乗算（v_mul）と加算（v_add）で積和計算を実行 — 静的解析で確認した CK inner_product.hpp の dot4 不在時フォールバックと整合。
Xdlops 系は solver によって 2 種類の失敗モード（ベクタ境界 assertion abort / MFMA intrinsic コンパイル失敗）を示し、失敗境界は dtype 軸ではなく solver ファミリー軸にある。
MLIR iGEMM はコンパイル段階まで進むが、パラメータ降格フェーズで MIIR_INVALID_PARAM で落ちる。静的解析の return false が正しい保護ガードであり、強制実行するとさらに下まで落ちることが直接確認できた。

11) 2026-03-14 follow-up — CIFS bottleneck resolved, local Debug MIOpen build succeeded 2026-03-14 11) 2026-03-14 追試 — CIFS ボトルネック解消、ローカル Debug MIOpen ビルド成功 2026-03-14

Did:実施: Diagnosed the previous 12h+ CMake stall as CIFS I/O wait, recloned the MIOpen source tree to WD-Black NVMe, then rebuilt a local Debug MIOpen with /opt/rocm/llvm/bin/clang++, -DMIOPEN_USE_MLIR=Off, -DMIOPEN_USE_COMPOSABLEKERNEL=Off, and a small half compatibility patch. 前日までの 12 時間超 CMake 停滞を CIFS I/O wait と診断し、MIOpen ソースを WD-Black NVMe に clone し直したうえで、/opt/rocm/llvm/bin/clang++、-DMIOPEN_USE_MLIR=Off、-DMIOPEN_USE_COMPOSABLEKERNEL=Off、および小さな half 互換パッチでローカル Debug MIOpen を再ビルドした。

Before

12h+

CMake configure stuck on CIFS

After move

7.5-9.6s

Configure on NVMe-local source

Build

ninja MIOpen / MIOpenDriver / install

Smoke test

FP32 OK

Local Debug MIOpen forward conv succeeded

# Root cause isolation cmake (previous day) → State: D (uninterruptible sleep on CIFS I/O) # After recloning to WD-Black NVMe cmake -G Ninja ... → configure done in 7.5-9.6 sec ninja MIOpen → OK ninja MIOpenDriver → OK ninja install → OK

Build blockers removed: CIFS source I/O was the main reason the earlier build looked "mysteriously impossible." Additional fixes were still needed: system GCC could not handle --offload-arch=gfx900, CK emitted v_fmac_f32 (gfx906+), and half 2.2.x required a compatibility patch. 外したビルド障害: 以前の「なぜか絶対に通らない」感触の主因は CIFS 上のソース I/O だった。加えて、システム GCC は --offload-arch=gfx900 を扱えず、CK は v_fmac_f32（gfx906+）を吐き、half 2.2.x には互換パッチが必要だった。

Result: A local Debug MIOpen build for gfx900 is feasible when storage placement and toolchain constraints are handled correctly. This is important because it shifts the diagnosis from "Vega cannot even be rebuilt" to "specific optional paths (MLIR / CK) are the problematic layer." 結果: ストレージ配置とツールチェーン制約を正しく処理すれば、gfx900 向けローカル Debug MIOpen は十分ビルド可能だと分かった。これは「Vega はそもそも再ビルド不能」ではなく、「問題は特定の optional path（MLIR / CK）に集中している」へ診断を前進させる。

Follow-up test:追試: Re-ran forced ConvMlirIgemmFwd on the system MIOpen for both INT8 and FP32 to clarify what happens after bypassing the normal applicability guard. 通常の適用ガードをバイパスした後に何が起きるかを明確にするため、システム MIOpen 上で ConvMlirIgemmFwd 強制実行を INT8 / FP32 の両方で再試行した。

# Forced MLIR follow-up (system MIOpen, 2026-03-14) CompileSolution: ConvMlirIgemmFwd GetInvoker: ConvMlirIgemmFwd Perf Db: record not found MIOpen(HIP): Warning ... boost::optional::get() Assertion ... terminated

Interpretation: -S can bypass IsApplicable() and push MLIR iGEMM into deeper stages. Once there, gfx900 still lacks complete downstream support: earlier runs failed at miirLowerTuningParams, and the 2026-03-14 follow-up exposed a second failure boundary at missing Perf DB tuning records. The failure is therefore more specific than a vague top-level "MLIR mystery." 解釈: -S は IsApplicable() をバイパスして MLIR iGEMM をさらに深い段階へ押し込める。しかしその先では gfx900 向け下流支援がまだ欠けている。初期追試では miirLowerTuningParams で失敗し、2026-03-14 の追試では Perf DB の tuning record 不在という別の失敗境界も見えた。つまり失敗像は、漠然とした「MLIR がダメ」よりずっと具体的になった。

12) 2026-03-15 follow-up — GitHub chronology, legacy repos, and PR-context synthesis 2026-03-15 12) 2026-03-15 追補 — GitHub 年表、legacy repo、PR 文脈の統合 2026-03-15

Did:実施: Extended the investigation from runtime/code tracing into GitHub-side history. Cross-read current ROCm clones, retired/deprecated repos, and the public PR/issue context behind MIOpen decisions #1231, #1328, and issue #1204. runtime / code tracing から一段進み、GitHub 側の履歴調査へ拡張した。current ROCm clone、retired / deprecated repo、そして MIOpen の判断点 #1231・#1328・issue #1204 の public PR / issue 文脈を照合した。

# GitHub-side anchors recovered on 2026-03-15 MIOpen #1231 → public issue #1204 → gfx900:sramecc-:xnack- target-name failure → WORKAROUND_ISSUE_1204 MIOpen #1328 → private llvm-project-private#389 → gfx900 MLIR non-xdlops disable + ctest/test-surface split legacy repos → ROCR-Runtime -> rocm-systems / Tensile -> rocm-libraries / ROCm/vllm -> upstream

Result: The public record now supports a more precise reading: #1231 is a defensive userspace workaround for a user-visible failure, while #1328 is a release/tuning-surface retreat whose core rationale remains private. In parallel, retired/deprecated repos show that repository topology and legacy knowledge can persist even while support visibility retreats. 結果: public 記録だけでも、より精密な読み分けが可能になった。#1231 は user-visible failure に対する defensive userspace workaround、#1328 は core rationale が private に残るまま進んだ release / tuning surface 側の後退として読める。並行して、retired / deprecated repo は repo topology の再編と legacy knowledge の残存も観測できることを示した。

Interpretation: At this point, gfx900 is better read not as a simple yes/no support case but as an observation point for ROCm's layered support, staged retreat, and mixed ownership structure. The public side still cannot recover the private contents of llvm-project-private#389, so the deeper root cause remains outside the observable record. 解釈: 現時点では、gfx900 は単純な support yes/no の事例というより、ROCm の layered support、staged retreat、主体分離を観測しやすい地点として読むほうが自然である。ただし public 側からは llvm-project-private#389 の本文を回収できず、より深い root cause は観測範囲の外に残る。

Scope note: This step adds GitHub chronology and PR-context synthesis, but it does not reconstruct the contents of private issues or internal decision-making. The public pages rocm-history.html and reveal-hypothesis.html carry the detailed wording for this phase. 範囲注記: この段階で GitHub 年表と PR 文脈の統合までは進んだが、private issue の本文や社内意思決定を再構成するものではない。詳細な文言整理は rocm-history.html と reveal-hypothesis.html に反映している。

13) 2026-03-18 to 2026-03-19 follow-up — INT8 route disambiguation and local driver provenance 2026-03-18-19 13) 2026-03-18〜2026-03-19 追補 — INT8 経路の切り分けと local driver provenance 2026-03-18-19

Did:実施: Continued the INT8 investigation beyond the 2026-03-13 forced-solver taxonomy. Added direct MIOpen probes with explicit y=int32 output descriptors, standard FindConvolutionForwardAlgorithm + ConvolutionForward checks, standalone rocblas-bench gemm_ex backend probes, and a later build-provenance check for the local Debug MIOpenDriver. 2026-03-13 時点の「強制 solver 実行と失敗分類」で止めず、INT8 調査をさらに継続した。明示的な y=int32 出力 descriptor を与える direct MIOpen probe、standard の FindConvolutionForwardAlgorithm + ConvolutionForward、standalone rocblas-bench gemm_ex backend probe、さらに local Debug MIOpenDriver の build provenance 確認を追加した。

# INT8 route narrowing, 2026-03-18 to 2026-03-19 direct query / immediate → id=89 visible when y=int32 standard Find/Forward → GEMM runs successfully when y=int32 is explicit MIOpenDriver --out_cast_type → not equivalent (...INT8-F_coFP32, casted-tensor path) rocblas-bench gemm_ex → INT8→INT32 backend success on gfx900 local Debug MIOpenDriver → built from miopen-src@f842c61d not from current public standalone clone

Key findings:

The practical INT8 route is narrower than the 2026-03-13 picture suggested, but it is not absent: with explicit x=int8, w=int8, y=int32 descriptors, the same installed MIOpen library exposes solution_id = 89, and both direct immediate execution and the standard higher-level C API succeed.
The legacy-looking MIOpenDriver convint8 --out_cast_type flags do not recreate that route. int32 is rejected as a token, while fp32 becomes a separate cast-aware problem (...INT8-F_coFP32) that still rejects the GEMM family.
Standalone rocblas-bench gemm_ex INT8 probes on gfx900 also succeed, so the blocked point is not the same thing as “no INT8 GEMM backend exists.”
A later provenance check narrowed one apparent source/runtime mismatch: the local Debug MIOpenDriver used for cross-checking was built from a separate checkout, miopen-src@f842c61d, which still retains the cast-aware driver path.

主要知見:

INT8 の practical route は 2026-03-13 時点の像より狭いが、不在ではない。明示的に x=int8, w=int8, y=int32 descriptor を与えると、同じ installed MIOpen library でも solution_id = 89 が見え、direct immediate だけでなく standard の higher-level C API からも成功する。
legacy 風の MIOpenDriver convint8 --out_cast_type は、この route を再現しない。int32 は token として reject され、fp32 は別の cast-aware problem（...INT8-F_coFP32）になって GEMM family をなお拒否する。
standalone の rocblas-bench gemm_ex INT8 probe も gfx900 で成功し、閉じている点は「INT8 GEMM backend が存在しない」ことそのものではないと確認できた。
さらに provenance check により、cross-check に使った local Debug MIOpenDriver は別 checkout の miopen-src@f842c61d からビルドされており、この checkout が cast-aware driver path を保持していたことも分かった。

Interpretation: The investigation moved from “INT8 looks absent on gfx900” to a more precise layered reading: the direct y=int32 route exists, the cast-aware driver path is a different problem, the backend itself can run, and one part of the source/runtime mismatch was explained by local build provenance rather than by a contradiction inside a single public tree. 解釈: 調査は「gfx900 では INT8 が無いように見える」から、より層別の読みへ進んだ。すなわち direct y=int32 route 自体は存在し、cast-aware driver path は別 problem であり、backend 単体も実行でき、さらに source/runtime の食い違いの一部は単一 public tree 内の矛盾ではなく local build provenance の差で説明できる。

Scope note: This step narrows the interpretation of the INT8 boundary, but it still does not fully establish the provenance of the installed /opt/rocm/bin/MIOpenDriver or reconstruct private-side intent. 範囲注記: この段階で INT8 境界の読みはかなり狭まったが、installed /opt/rocm/bin/MIOpenDriver の provenance 全体や private 側の意図まではまだ確定していない。

Experiment History 実験の流れ

1) Problem framing & environment isolation 1) 問題設定と環境統制

2) Build-time & runtime gate analysis 2) ビルド時・実行時ゲート解析

3) Semantic trace of `num_gpu` through 9 source points 3) `num_gpu` の意味を9ファイルにわたって追跡

4) First matched backend comparison 2026-03-07 4) 初回バックエンド比較実験 2026-03-07

5) Failure localization — SIGSEGV stack trace & Vulkan classification 5) 故障局在化 — SIGSEGV スタックトレースと Vulkan 分類

6) Reproduction & quantitative crash-timing analysis 2026-03-08 6) 再現確認とクラッシュ timing 定量分析 2026-03-08

7) Model-dependency probe & crash-phase localization 2026-03-08 7) モデル依存性の探索とクラッシュ段階の絞り込み 2026-03-08

8) Static code audit — ROCm library stack (MIOpen / rocBLAS / CK / Tensile) 2026-03-13 8) 静的コード監査 — ROCm ライブラリスタック（MIOpen / rocBLAS / CK / Tensile） 2026-03-13

9) Vega64 — FP32 solver selection confirmed on real hardware 2026-03-13 9) Vega64 実機 — FP32 solver 選択の動的確認 2026-03-13

10) INT8 deep probe — forced solvers, failure taxonomy, and disassembly 2026-03-13 10) INT8 深掘り — 強制 solver 実行・失敗分類・逆アセンブル 2026-03-13

11) 2026-03-14 follow-up — CIFS bottleneck resolved, local Debug MIOpen build succeeded 2026-03-14 11) 2026-03-14 追試 — CIFS ボトルネック解消、ローカル Debug MIOpen ビルド成功 2026-03-14

12) 2026-03-15 follow-up — GitHub chronology, legacy repos, and PR-context synthesis 2026-03-15 12) 2026-03-15 追補 — GitHub 年表、legacy repo、PR 文脈の統合 2026-03-15

13) 2026-03-18 to 2026-03-19 follow-up — INT8 route disambiguation and local driver provenance 2026-03-18-19 13) 2026-03-18〜2026-03-19 追補 — INT8 経路の切り分けと local driver provenance 2026-03-18-19

本文書が主張しないこと / This document does not claim that...

Experiment History 実験の流れ

1) Problem framing & environment isolation 1) 問題設定と環境統制

2) Build-time & runtime gate analysis 2) ビルド時・実行時ゲート解析

3) Semantic trace of num_gpu through 9 source points 3) num_gpu の意味を9ファイルにわたって追跡

4) First matched backend comparison 2026-03-07 4) 初回バックエンド比較実験 2026-03-07

5) Failure localization — SIGSEGV stack trace & Vulkan classification 5) 故障局在化 — SIGSEGV スタックトレースと Vulkan 分類

6) Reproduction & quantitative crash-timing analysis 2026-03-08 6) 再現確認とクラッシュ timing 定量分析 2026-03-08

7) Model-dependency probe & crash-phase localization 2026-03-08 7) モデル依存性の探索とクラッシュ段階の絞り込み 2026-03-08

8) Static code audit — ROCm library stack (MIOpen / rocBLAS / CK / Tensile) 2026-03-13 8) 静的コード監査 — ROCm ライブラリスタック（MIOpen / rocBLAS / CK / Tensile） 2026-03-13

9) Vega64 — FP32 solver selection confirmed on real hardware 2026-03-13 9) Vega64 実機 — FP32 solver 選択の動的確認 2026-03-13

10) INT8 deep probe — forced solvers, failure taxonomy, and disassembly 2026-03-13 10) INT8 深掘り — 強制 solver 実行・失敗分類・逆アセンブル 2026-03-13

11) 2026-03-14 follow-up — CIFS bottleneck resolved, local Debug MIOpen build succeeded 2026-03-14 11) 2026-03-14 追試 — CIFS ボトルネック解消、ローカル Debug MIOpen ビルド成功 2026-03-14

12) 2026-03-15 follow-up — GitHub chronology, legacy repos, and PR-context synthesis 2026-03-15 12) 2026-03-15 追補 — GitHub 年表、legacy repo、PR 文脈の統合 2026-03-15

13) 2026-03-18 to 2026-03-19 follow-up — INT8 route disambiguation and local driver provenance 2026-03-18-19 13) 2026-03-18〜2026-03-19 追補 — INT8 経路の切り分けと local driver provenance 2026-03-18-19

本文書が主張しないこと / This document does not claim that...

3) Semantic trace of `num_gpu` through 9 source points 3) `num_gpu` の意味を9ファイルにわたって追跡