ggml-hip

このページで得られる理解：libggml-hip.so の .hip_fatbin には gfx1200/gfx1201 向け custom kernel が埋め込まれているが、gfx900（MI25）には存在しない——この非対称性が「世代ごとの世界観」を決定づける。

What you'll gain here: libggml-hip.so's .hip_fatbin contains custom kernels for gfx1200/gfx1201 but not for gfx900 (MI25) — this asymmetry is what determines each generation's "observation worldview."

libggml-hip.so .hip_fatbin 587 MiB gfx1200 / gfx1201: fatbin 内ターゲット確認 gfx1200 / gfx1201: fatbin targets confirmed gfx900 / gfx942: fatbin 内に未検出 gfx900 / gfx942: Not found in fatbin

このページの結論Page Conclusion

libggml-hip.so の .hip_fatbin（587 MiB）に gfx1200 / gfx1201 向け custom kernel が埋め込まれていることを文字列スキャンで確認。
同スキャンで gfx900 / gfx942 のターゲット文字列は検出されなかった——これは MI25（gfx900）の「fallback が主戦場」を ggml-hip レベルで直接支持する証拠。
gfx1201 向けには MMVQ / MMQ / Flash Attention / RoPE の custom kernel family が存在し、逆アセンブリで確認済み（wavefront_size=32、v_mfma なし）。
どの bundle が実際の live case で使われたかは、dispatch-safe observer なしには確定しない。

String scan confirmed gfx1200 / gfx1201 target strings embedded in libggml-hip.so's .hip_fatbin (587 MiB).
The same scan found no gfx900 / gfx942 target strings — this directly supports the "fallback as main stage" worldview for MI25 (gfx900) at the ggml-hip level.
For gfx1201: MMVQ / MMQ / Flash Attention / RoPE custom kernel families exist and are confirmed via disassembly (wavefront_size=32, no v_mfma).
Which bundle was used in the actual live case cannot be determined without a dispatch-safe observer.

ROCm スタックにおける位置づけPosition in the ROCm Stack

ggml-hip は ggml の backend abstraction 上に実装された HIP 特化バックエンドです。CUDA 側の ggml-cuda と同一ソース系統（ggml-cuda.cu / mmq.cu / mmvq.cu など）から HIP ビルドされます。vendors/hip.h が cublasCreate 等の CUDA API を hipBLAS 等価物にリマップし、コードの共通化を実現しています。

ggml-hip is the HIP-specialized backend on top of ggml's backend abstraction. It shares the same source family as ggml-cuda (ggml-cuda.cu, mmq.cu, mmvq.cu, etc.), built with HIP. vendors/hip.h remaps cublasCreate and related CUDA APIs to hipBLAS equivalents, enabling a shared codebase.

Ollama runnerOllama runner ↓ ml.NewBackend("ggml") → ggml_backend_load_all_from_path() ↓ libggml-hip.so （/usr/local/lib/ollama/rocm/ から優先ロード）(loaded with priority from /usr/local/lib/ollama/rocm/) ↓ GGML_OP_MUL_MAT dispatch ne11 ≤ 8 → mul_mat_vec_q (custom hsaco / MMVQ) ne11 ≤ 256 Q4_K → mul_mat_q (custom hsaco / MMQ) その他otherwise → hipBLAS → rocBLAS ↓ Flash Attn → flash_attn_ext_f16 (custom hsaco)

→ 次の問い：この dispatch 分岐で使われる custom kernel は、どの GPU 世代向けに存在するのか？

→ Next question: Which GPU generations have custom kernels available for these dispatch branches?

世代横断：.hip_fatbin 内ターゲットCross-Generation: .hip_fatbin Targets

観測ポイント：libggml-hip.so の .hip_fatbin にどの GPU 世代向け kernel が存在するか。
文脈：この有無が「custom kernel で動く」か「BLAS fallback が主」かを世代ごとに決定づける。
ソース：strings -a /usr/local/lib/ollama/rocm/libggml-hip.so による target 文字列スキャン（Phase D 実機確認）

Observation point: Which GPU generations have kernels embedded in libggml-hip.so's .hip_fatbin.
Context: Presence or absence here determines whether a generation runs on "custom kernels" or "BLAS fallback as primary."
Source: Target string scan via strings -a /usr/local/lib/ollama/rocm/libggml-hip.so (Phase D live confirmed)

世代Generation	GFX	fatbin 内ターゲットfatbin targets	custom kernelcustom kernel	世界観への含意Worldview implication
GCN5 / MI25	`gfx900`	未検出Not found	なしNone	BLAS fallback が唯一の行列演算経路。Tensile / rocBLAS 経由のみ。fallback が主戦場である理由がここで確定する。BLAS fallback is the only matrix compute path. Only via Tensile / rocBLAS. This is the evidence that fallback is the main stage.
RDNA3 / RX7900	`gfx1100` 系	本スキャン対象外Not in this scan	—	gfx1200/gfx1201 と同一バンドルに含まれる可能性あり（未確認）May be in the same bundle as gfx1200/gfx1201 (unconfirmed)
RDNA3.5 / RX7900 GRE	`gfx1200`	確認済みConfirmed	あり（gfx1201 と共通 bundle 系）Present (same bundle family as gfx1201)	gfx1201 と同様の custom kernel 経路が利用可能Same custom kernel paths as gfx1201 available
RDNA4 / RX9070XT	`gfx1201`	確認済みConfirmed	あり（MMVQ / MMQ / FA / RoPE）Present (MMVQ / MMQ / FA / RoPE)	custom kernel が使える。どの経路が実際に使われるかは ne11 と観測条件に依存Custom kernels available. Which path is actually used depends on ne11 and observation conditions
CDNA4 / MI300X	`gfx942`	未検出Not found	なし（この bundle では）None (in this bundle)	Ollama ROCm bundle は MI300X 向けに gfx942 custom kernel を埋め込んでいない可能性。別ビルドが存在するか未確認。The Ollama ROCm bundle may not embed gfx942 custom kernels for MI300X. Whether a separate build exists is unconfirmed.

重要：gfx900（MI25）で MMVQ / MMQ の custom kernel 経路が存在しないことは、「fallback が主戦場」世界観をコードレベルで裏付ける直接証拠。MI25 では BLAS 経由が唯一の行列演算経路である。 Important: The absence of MMVQ / MMQ custom kernel paths for gfx900 (MI25) is direct code-level evidence for the "fallback as main stage" worldview. On MI25, BLAS mediation is the only matrix compute path.

→ 次の問い：gfx1201 向けには具体的にどんな bundle が埋め込まれているのか？

→ Next question: What specific bundles are embedded for gfx1201?

gfx1201 bundle 詳細（RX9070XT 実機確認）gfx1201 Bundle Details (RX9070XT Live Confirmed)

観測ポイント：gfx1201 向け bundle の具体的な kernel family と dispatch 上の役割。

Observation point: Concrete kernel families and dispatch roles for gfx1201 bundles.

.hip_fatbin セクションは 587 MiB。__CLANG_OFFLOAD_BUNDLE__ が連結されたストリームであり、単一の per-target blob ではない。
clang-offload-bundler --list で hipv4-amdgcn-amd-amdhsa--gfx1201 ターゲットの存在を確認。
gfx1201 固有の全 bundle は wavefront_size = 32。v_mfma 命令は現行 inspect ウィンドウでは未検出。

The .hip_fatbin section is 587 MiB, a concatenated __CLANG_OFFLOAD_BUNDLE__ stream — not a single per-target blob.
hipv4-amdgcn-amd-amdhsa--gfx1201 target presence confirmed via clang-offload-bundler --list.
All gfx1201 bundles: wavefront_size=32. No v_mfma detected in the current inspection window.

Bundle	サイズSize	主なシンボルKey Symbols	役割Role
`bundle_0012`	173 KB	`dequantize_block_q4_K<float>`	Q4_K dequant（v_fma:130, global_load:325）Q4_K dequant (v_fma:130, global_load:325)
`bundle_0014`	210 KB	`cpy_f32_q<q4_0>`	quantized copy / transpose helperquantized copy / transpose helper
`bundle_0019`	337 KB	`flash_attn_ext_f16<...>`	Flash Attention（多数 template variant）Flash Attention (many template variants)
`bundle_0030`	1019 KB	`mul_mat_vec_q<Q4_K,1,false>`	Q4_K MMVQ anchor（decode 経路の主候補）sgpr:32–34, vgpr:26–61Q4_K MMVQ anchor (primary candidate for decode path) sgpr:32–34, vgpr:26–61
`bundle_0037`	17 KB	`quantize_mmq_q8_1<layout0/1/2>`	MMQ repack helperMMQ repack helper
`bundle_0039`	332 KB	`rope_multi/rope_neox/rope_norm`	RoPE family（sgpr:19–29, vgpr:19–20）RoPE family (sgpr:19–29, vgpr:19–20)
`bundle_0096`	918 KB	`mul_mat_q<Q4_K,8,false>`	Q4_K MMQ anchor（prefill / large-batch 経路の主候補）sgpr:36, vgpr:51Q4_K MMQ anchor (primary candidate for prefill / large-batch path) sgpr:36, vgpr:51

示せること / 示せないことWhat Can and Cannot Be Shown

示せることCan Show	示せないことCannot Show
gfx1200 / gfx1201 向け custom kernel が fatbin 内に存在する（文字列スキャン確認）gfx1200 / gfx1201 custom kernels exist in the fatbin (string scan confirmed)	live case でどの bundle が実際に dispatch されたか（dispatch-safe observer なし）Which bundle was actually dispatched in the live case (no dispatch-safe observer)
gfx900 / gfx942 向け custom kernel が fatbin 内に存在しないこと（文字列スキャン）gfx900 / gfx942 custom kernels are absent from the fatbin (string scan)	gfx942（MI300X）の別 Ollama build が gfx942 custom kernel を持つかは未確認Whether a separate gfx942 Ollama build includes gfx942 custom kernels is unconfirmed
bundle_0030（MMVQ）・bundle_0096（MMQ）・bundle_0019（FA）の逆アセンブリ確認（wavefront_size=32）Disassembly confirmed for bundle_0030 (MMVQ), bundle_0096 (MMQ), bundle_0019 (FA) — wavefront_size=32	.hip_fatbin スキャンは bounded であり、後半に未抽出 cluster が残っている可能性The .hip_fatbin scan was bounded; additional clusters may remain in later sections

未確定事項Open Questions

live short case での各 bundle の実際の参加比率は、dispatch-safe observer なしには確定しない。
Flash Attention entry の一部は helper trampoline であり、1シンボル＝1 kernel body の単純解釈が成立しない箇所がある。
gfx942（MI300X）向けの Ollama 別 bundle に custom kernel が存在するか。
gfx900 が BLAS fallback のみで動く場合、throughput はどの程度か（MI25 実機 rocprofv3 による定量化が可能）。

Actual participation ratio of each bundle in the live short case cannot be determined without a dispatch-safe observer.
Some Flash Attention entries are helper trampolines; the simple one-symbol-to-one-kernel-body interpretation does not hold there.
Whether a separate gfx942 Ollama bundle includes custom kernels for MI300X.
If gfx900 runs on BLAS fallback alone, what is the throughput? (Quantifiable via rocprofv3 on live MI25 hardware.)

次の観測点Next Observation Points

RX9070XT 経路観測 — MMVQ / MMQ / BLAS の dispatch 分岐と phase proxy による確認
rocBLAS — gfx900 / gfx942 での BLAS fallback の受け手
Observer 制約マップ — なぜ direct observation ができないのか

RX9070XT Path Observation — MMVQ / MMQ / BLAS dispatch branching and phase proxy confirmation
rocBLAS — The recipient of BLAS fallback for gfx900 / gfx942
Observer Constraint Map — Why direct observation is not possible

掲載情報は観測記録に基づきます。未確定事項は確定次第更新します。 Content is grounded in observation logs. Open questions are updated as findings solidify.