Solver Trace — gfx900 Computation Paths in ROCm

1) MIOpen solver selection — candidate list + IsApplicable filter 1) MIOpen solver 選択メカニズム — 候補列挙 + IsApplicable フィルタ

MIOpen does not use an explicit if-else fallback chain. Instead, it maintains a global solver registry and filters it through IsApplicable() at runtime. MIOpen は明示的な if-else フォールバックチェーンを使わない。代わりに、グローバルな solver レジストリを保持し、実行時に IsApplicable() でフィルタリングする方式。

// miopen/src/conv/solver_finders.cpp : 98-110 if(!parameters.use_winograd_only && !env::disabled(MIOPEN_DEBUG_CONV_IMPLICIT_GEMM)) { FindAllImplicitGemmSolutions(...) // FWD / BwdData FindImplicitGemmWrWAllSolutions(...) // BwdWrW } // miopen/src/include/miopen/find_solution.hpp : 326 / 381 / 451 else if(!solver.IsApplicable(ctx, problem)) { log("Not applicable"); continue; } // miopen/src/problem.cpp : 573 / 625 if(!solver->IsApplicable(ctx, problem_description)) log("Not applicable");

Design implication: Old-generation solver entries remain registered even as new solvers are added. On gfx900, newer solvers fall out via IsApplicable, and legacy solvers with explicit gfx900 whitelisting survive the filter — this is how the "maintain" layer works. 設計上の含意: 新しい solver が追加されても旧世代 solver のエントリはレジストリに残る。gfx900 では新しめの solver が IsApplicable で落ち、gfx900 を明示的に許可している旧世代 solver が生き残る構造 — これが「維持」層の仕組み。

2) MLIR iGEMM — gfx900 explicitly excluded (commit 2407d2f, Dec 2021) 2) MLIR iGEMM — gfx900 を明示除外（コミット 2407d2f、2021年12月）

Three files — conv_mlir_igemm_fwd.cpp:188, conv_mlir_igemm_bwd.cpp:68, conv_mlir_igemm_wrw.cpp:69 — each contain an explicit early-return for gfx900. 3 ファイル — conv_mlir_igemm_fwd.cpp:188、conv_mlir_igemm_bwd.cpp:68、conv_mlir_igemm_wrw.cpp:69 — それぞれに gfx900 向けの early-return が含まれる。

// conv_mlir_igemm_fwd.cpp : 188 (identical pattern in bwd.cpp:68, wrw.cpp:69) if(StartsWith(device_name, "gfx900")) { // Refer to https://github.com/ROCm/llvm-project-private/issues/389 return false; }

Commitコミット	Date日付	Author作者	Messageメッセージ
2407d2f	2021-12-22	Zhuoran Yin (AMD)	`[MLIR] Disable gfx900 from non-xdlops solver (#1328)`

Note on the cited issue: llvm-project-private/issues/389 refers to AMD's internal (private) LLVM repository — not the public MIOpen #389 or public llvm-project #389. The issue reference suggests an internal LLVM-side reason for the original exclusion, but the details are not publicly accessible. A 2026-03-14 runtime follow-up also exposed missing gfx900 tuning records after forced guard bypass, so the public evidence does not support collapsing the whole problem to a single visible cause. 参照 issue について: llvm-project-private/issues/389 は AMD 社内の非公開 LLVM リポジトリ — 公開 MIOpen #389 や公開 llvm-project #389 とは別物。この参照は「元の除外理由のどこかに LLVM 側事情があった」ことを示唆するが、詳細は外部から確認できない。さらに 2026-03-14 の追試では、強制実行後に gfx900 向け tuning record 不在も露出しており、公開情報だけで原因を単一レイヤに還元することはできない。

（この issue は非公開であり、本文は外部から確認できない。公開側から確認できるのは参照関係と gating の痕跡のみ。）

Runtime confirmation: Forced execution via miopen-driver -S ConvMlirIgemmFwd on Vega64 confirms that the static guard is meaningful. Earlier runs hit miirLowerTuningParams: MIIR_INVALID_PARAM → RunForwardGPU() FAILED, rc=0x7. A 2026-03-14 follow-up on both INT8 and FP32 reached CompileSolution → GetInvoker and then failed with Perf Db: record not found → boost::optional::get() assertion. In other words, -S can bypass IsApplicable(), but downstream MLIR/tuning support for gfx900 remains incomplete. 実機確認: Vega64 で miopen-driver -S ConvMlirIgemmFwd を強制実行すると、静的ガードが有意味であることが逆に確認できる。初期の追試では miirLowerTuningParams: MIIR_INVALID_PARAM → RunForwardGPU() FAILED, rc=0x7。さらに 2026-03-14 の追試では INT8/FP32 の両方で CompileSolution → GetInvoker まで進んだ後、Perf Db: record not found → boost::optional::get() assertion で停止した。つまり -S は IsApplicable() をバイパスできるが、その先の MLIR/tuning 支援は gfx900 で未完成のままである。

2026-03-15 addendum — Miir gating mechanism is publicly traceable: The full Miir C API implementation was found in the public ROCm/rocMLIR repository (mlir/tools/rocmlir-lib/rocmlir-lib.cpp). The chain is: miirCreateHandle (parseConvConfig → isApplicable → RockEnabled layout/dtype whitelist) → miirLowerTuningParams (rock::buildKernelPipeline in Applicability mode → MIIR_BUILD_FAILURE on failure). MIOpen's MiirIsConfigApplicable is simply the MIIR_SUCCESS check on miirLowerTuningParams. This means the gating mechanism itself is public, even though the root cause (#389) remains private. 2026-03-15 追補 — Miir gating メカニズムは公開追跡可能: 公開 ROCm/rocMLIR リポジトリ（mlir/tools/rocmlir-lib/rocmlir-lib.cpp）から、Miir C API の実装全体を追跡できることを確認した。チェーンは: miirCreateHandle（parseConvConfig → isApplicable → RockEnabled layout/dtype ホワイトリスト）→ miirLowerTuningParams（rock::buildKernelPipeline の Applicability モード → 失敗時 MIIR_BUILD_FAILURE）。MIOpen 側の MiirIsConfigApplicable は miirLowerTuningParams の MIIR_SUCCESS 判定のみ。つまり gating メカニズム自体は public だが、根拠（#389）は依然 private。

3) ASM implicit GEMM v4r1 dynamic — explicit gfx900 / gfx906 whitelist 3) ASM implicit GEMM v4r1 dynamic — gfx900/gfx906 明示ホワイトリスト

Four solver files carry the same conditional: StartsWith(device_name, "gfx900") || StartsWith(device_name, "gfx906"). 4 つの solver ファイルが同一の条件を持つ: StartsWith(device_name, "gfx900") || StartsWith(device_name, "gfx906")。

// conv_asm_implicit_gemm_v4r1_dynamic.cpp : 293 (FWD general) // conv_asm_implicit_gemm_v4r1_dynamic.cpp : 343 (FWD 1x1) // conv_asm_implicit_gemm_bwd_v4r1_dynamic.cpp : 142 (BWD) // conv_asm_implicit_gemm_wrw_v4r1_dynamic.cpp : 306 (WRW) return StartsWith(device_name, "gfx900") || StartsWith(device_name, "gfx906"); // ... plus shape / padding / stride conditions

Runtime confirmation: ConvHipImplicitGemmV4R1Fwd (the HIP counterpart) was selected by MIOpen on Vega64 for a 3×3 stride-2 FP32 case (verified OK). Note that ConvAsmImplicitGemmV4R1DynamicFwd_1x1 forced on an INT8 shape triggered a GPU memory access fault — shape and dtype conditions must be met for safe execution. 実機確認: Vega64 の FP32 3×3 stride-2 ケースで ConvHipImplicitGemmV4R1Fwd（HIP 版相当）が MIOpen に選択され、verify OK となった。なお、INT8 形状に ConvAsmImplicitGemmV4R1DynamicFwd_1x1 を強制すると GPU メモリアクセス fault が発生した — 安全な実行には形状と dtype の条件が重要。

4) Winograd and legacy ASM paths — survive for gfx900 4) Winograd・旧 ASM 系 — gfx900 向けに生存

Multiple Winograd solvers and legacy fixed-shape ASM kernels explicitly include or prefer gfx900. 複数の Winograd solver と旧式固定形状 ASM カーネルが gfx900 を明示的に含む・優先する。

Solverソルバ	gfx900 handlinggfx900 の扱い	Sourceソース
`ConvBinWinograd3x3U`	permitted許可	conv_bin_wino3x3U.cpp:61
`ConvBinWinogradRxS*`	permitted; v21 path prioritized for gfx900/906許可; v21 経路が gfx900/906 で優先	conv_bin_winoRxS.cpp:265, conv_winoRxS.cpp:210
`ConvMPBidirectionalWinograd`	gfx900/906/908 limitedgfx900/906/908 限定	conv_MP_bidirectional_winograd.cpp:202
`ConvWinoFuryRxS` (multipass WrW)	gfx8/900/906/908/90a branchesgfx8/900/906/908/90a 分岐	conv_multipass_wino3x3WrW.cpp:490,501
`ConvAsm5x10u2v2f1`, `ConvAsm7x7c3h224*`	fixed-shape legacy ASM permitted固定形状 legacy ASM が許可	conv_asm_5x10u2v2f1.cpp:68, conv_asm_7x7*.cpp:76
`ConvAsm1x1U`	permitted (gfx8/9 family)許可（gfx8/9 系）	conv_asm_1x1u.cpp

Runtime confirmation: On Vega64, natural solver selection yielded ConvBinWinograd3x3U (3×3 s1 FP32), ConvAsm1x1U (1×1 s1 FP32), and ConvBinWinogradRxSf2x3 (3×3 grouped FP32) — all verified OK. 実機確認: Vega64 での自然選択では ConvBinWinograd3x3U（3×3 s1 FP32）・ConvAsm1x1U（1×1 s1 FP32）・ConvBinWinogradRxSf2x3（3×3 group FP32）が選択され、いずれも verify OK。

Maintainer timeline (git blame): The Winograd solvers were largely introduced by Artem Tamazov (artem.tamazov@gmail.com) starting from 2017. AMD-affiliated staff have continued patching them: Gleb Larochkin (glarochk@amd.com) patched conv_bin_wino3x3U.cpp in September 2021; Evgenii Averin added an explicit gfx9 instruction-limit comment to conv_MP_bidirectional_winograd.cpp on 2025-03-11. These paths are not simply abandoned code — AMD attention was observed as recently as early 2025. 維持者時系列（git blame）: Winograd solver 群は主に Artem Tamazov（artem.tamazov@gmail.com）が 2017 年以降に投入した。その後も AMD 関連スタッフが補修を続けており、Gleb Larochkin（glarochk@amd.com）が 2021-09 に conv_bin_wino3x3U.cpp を補修。Evgenii Averin は 2025-03-11 に conv_MP_bidirectional_winograd.cpp へ gfx9 命令制限の明示コメントを追加している。これらの経路は「放置されたコード」ではなく、2025 年初頭時点でも AMD 側から作業が入っている。

5) XDLops solver family — systematically excluded for gfx900 5) XDLops solver ファミリー — gfx900 で系統的に除外

All XDLops-family solvers share a common gate via implicitgemm_util.hpp:101-105. The default IsXdlopsSupport() lists only gfx908 / gfx90a / gfx942 / gfx950 — gfx900 is absent. 全 XDLops 系 solver は implicitgemm_util.hpp:101-105 の共通ゲートを持つ。既定の IsXdlopsSupport() は gfx908・gfx90a・gfx942・gfx950 のみ — gfx900 は含まれない。

// miopen/src/include/miopen/solver/implicitgemm_util.hpp : 95-105 inline bool IsXdlopsSupport(const ExecutionContext& ctx) { if(env::enabled(MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_XDLOPS_EMULATE)) return true; // override: force-emulate return StartsWith(ctx.device_name, "gfx908") || StartsWith(ctx.device_name, "gfx90a") || StartsWith(ctx.device_name, "gfx942") || StartsWith(ctx.device_name, "gfx950"); // gfx900 → returns false by default }

Runtime failures observed (forced execution):
• ConvHipImplicitGemmFwdXdlops / GrpFwd → std::vector::operator[] assertion abort (EXIT=134) — same across INT8, FP16, BFP16.
• ConvHipImplicitGemmForwardV4R5Xdlops → intrin_mfma_* / gcnasm_mfma_* undeclared → code object build failed (rc=0x7) — same across INT8, FP16, BFP16.
Failure mode is solver-family-level, not dtype-level. MFMA intrinsics simply do not exist on gfx900. 実機での失敗観測（強制実行）:
• ConvHipImplicitGemmFwdXdlops / GrpFwd → std::vector::operator[] assertion abort（EXIT=134）— INT8・FP16・BFP16 で同型。
• ConvHipImplicitGemmForwardV4R5Xdlops → intrin_mfma_*・gcnasm_mfma_* 未定義 → code object ビルド失敗（rc=0x7）— INT8・FP16・BFP16 で同型。
失敗モードは dtype 軸ではなく solver ファミリー軸で決まる。gfx900 には MFMA intrinsic が存在しない。

6) Composable Kernel (CK) — dot4-absent fallback path 6) Composable Kernel（CK）— dot4 不在時フォールバック経路

composablekernel/include/ck/utility/inner_product.hpp:179-201 contains the INT8 inner product. Three branches exist: hardware dot4 (CK_USE_AMD_V_DOT4_I32_I8), GFX11 dot4, and a sequential fallback. composablekernel/include/ck/utility/inner_product.hpp:179-201 に INT8 内積の実装がある。3 分岐: ハードウェア dot4（CK_USE_AMD_V_DOT4_I32_I8）、GFX11 dot4、逐次フォールバック。

// composablekernel/include/ck/utility/inner_product.hpp : 179-201 template<> struct inner_product<int8x4_t, int8x4_t, int32_t> { #if defined(CK_USE_AMD_V_DOT4_I32_I8) // hardware dot4 path (gfx906+) c += __builtin_amdgcn_sdot4(a, b, 0, false); #elif defined(CK_USE_AMD_V_DOT4_I32_I8_GFX11) // GFX11 dot4 variant #else // fallback: sequential 4-element accumulate static_for<0, 4, 1>{}([&](auto i) { c += cast_to<int32_t>(a[i]) * cast_to<int32_t>(b[i]); }); #endif };

legacy_composable_kernel/composable_kernel/include/utility/config.hpp:51: gfx803 and gfx900 use CK_USE_AMD_V_MAC_F32; CK_USE_AMD_V_DOT4_I32_I8 is only enabled for gfx906+. legacy_composable_kernel/…/config.hpp:51: gfx803・gfx900 は CK_USE_AMD_V_MAC_F32; CK_USE_AMD_V_DOT4_I32_I8 は gfx906 以降のみ有効。

Disassembly confirmation: The INT8 naive kernel (naive_conv_ab_nonpacked_fwd_nchw_int8_t_int32_t_int8_t) extracted from the Vega64 MIOpen user DB and disassembled shows v_mul_i32_i24 + v_add_u32 operations for INT8 arithmetic — no v_dot4_i32_i8, v_dot4c_i32_i8, sdot4, or sudot4 instructions found. Consistent with the static fallback branch. 逆アセンブル確認: Vega64 MIOpen ユーザ DB から抽出した INT8 naive カーネル（naive_conv_ab_nonpacked_fwd_nchw_int8_t_int32_t_int8_t）を逆アセンブルすると、INT8 演算に v_mul_i32_i24 + v_add_u32 が使われており、v_dot4_i32_i8・v_dot4c_i32_i8・sdot4・sudot4 は未検出。静的フォールバック分岐と整合。

7) Tensile — ISA capability table gates dot4 for gfx900 7) Tensile — ISA 能力テーブルが gfx900 の dot4 を gate

shared/tensile/Tensile/AsmCaps.py contains a per-ISA capability dictionary. ISA (9, 0, 0) corresponds to gfx900 and has all dot4 capabilities set to False. shared/tensile/Tensile/AsmCaps.py には ISA ごとの能力辞書がある。ISA (9, 0, 0) が gfx900 に対応し、dot4 系能力は全て False。

# shared/tensile/Tensile/AsmCaps.py : 128 (ISA (9,0,0) = gfx900) (9, 0, 0): { ... "VOP3v_dot4_i32_i8": False, # line 155 "v_dot4_i32_i8": False, # line 158 "v_dot4c_i32_i8": False, # line 159 ... } # shared/tensile/Tensile/Code.py : 628 # "int8 not implemented yet for gfx900" ← inline comment # shared/tensile/Tensile/Common.py : 2065-2067 # v_dot4_i32_i8 / v_dot4c_i32_i8 / VOP3v_dot4_i32_i8 # → tested at assemble time to populate the capability dict

Tensile's catalog structure also carries gfx900-specific artifacts: solution-selection-catalogs.rst references both TensileLibrary_lazy_gfx900.yaml (lazy loading catalog) and ...fallback_gfx900.hsaco (fallback code object) — showing that the supplement layer is architecturally expected. Tensile のカタログ構造にも gfx900 向けアーティファクトがある: solution-selection-catalogs.rst に TensileLibrary_lazy_gfx900.yaml（lazy loading カタログ）と ...fallback_gfx900.hsaco（フォールバック code object）への参照があり、補充層が設計上想定されていることを示す。

8) rocBLAS — gfx900 lazy loading + multi-stage runtime fallback 8) rocBLAS — gfx900 lazy loading + 多段実行時フォールバック

// rocblas/library/src/tensile_host.cpp // :232-240 getLazyLoadingArch — explicit gfx900 map if(deviceString.find("gfx900") != std::string::npos) return Tensile::LazyLoadingInit::gfx900; // :1232 hipBLASLt failure → Tensile fallback log("hipBlasLT failed, falling back to tensile."); // :1161 XF32 solution not found → FP32 fallback log("No Tensile solution found for XF32, fall back to FP32"); // rocblas/CMakeLists.txt : 80-85 # TARGET_LIST_ROCM_5.6 … TARGET_LIST_ROCM_7.1 all include "gfx900"

Summary of the "supplement" layer:
① The build system keeps gfx900 in target lists through at least ROCm 7.1.
② At runtime, gfx900 is explicitly routed to its own lazy-loaded library entry.
③ If hipBLASLt fails, rocBLAS automatically falls back to Tensile.
④ If no XF32 Tensile solution is found, rocBLAS falls back to FP32.
These are independent automatic recovery mechanisms — each one ensures a computation result even when a faster path is unavailable. 「補充」層のまとめ:
① ビルドシステムは ROCm 7.1 以降もターゲットリストに gfx900 を保持。
② 実行時に gfx900 は専用の lazy loading ライブラリエントリへ明示ルーティング。
③ hipBLASLt が失敗すると、rocBLAS は自動的に Tensile にフォールバック。
④ XF32 の Tensile 解が見つからない場合、rocBLAS は FP32 にフォールバック。
それぞれ独立した自動回復メカニズムで、高速経路が使えない場合でも計算結果を保証する。

8a) INT8 alternative candidate — the concrete public-tree route is GEMM-shaped 8a) INT8 alternative candidate — 具体的に見える public-tree 経路は GEMM 系

The current public tree does not expose literal dp4a as the canonical naming. What appears in source are nearby terms such as miopenInt8x4, rocblas_gemm_flags_pack_int8x4, v_dot4_i32_i8, and __builtin_amdgcn_sdot4. current public tree では、literal な dp4a は canonical naming としては出てこない。実際に現れるのは miopenInt8x4、rocblas_gemm_flags_pack_int8x4、v_dot4_i32_i8、__builtin_amdgcn_sdot4 のような近接語である。
The most concrete currently visible INT8 forward candidate is solver/gemm.cpp:511-528 GemmFwd1x1_0_1_int8::IsApplicable(). It requires 1x1 weights, zero padding, stride 1, miopenInt8, group 1, and non-zero workspace. 現時点で最も具体的に見える INT8 forward candidate は solver/gemm.cpp:511-528 の GemmFwd1x1_0_1_int8::IsApplicable() である。条件は 1x1 weight、pad 0、stride 1、miopenInt8、group 1、workspace 非ゼロ。

// solver/gemm.cpp : 511-528 bool GemmFwd1x1_0_1_int8::IsApplicable(...) { if(!GemmFwdBase::IsApplicable(context, problem)) return false; return all_of(wei_spatial, == 1) && all_of(conv.GetConvPads(), == 0) && all_of(conv.GetConvStrides(), == 1) && wDesc.GetType() == miopenInt8 && conv.group_count == 1 && GetWorkspaceSize(context, problem) > 0; }

Backend connection: The downstream GEMM entry point is gemm_v2.cpp:475-571 CallGemm(), which can route INT8/INT8x4 work to CallGemmMIOpenTensile() or rocBLAS. The same source also exposes miopen_tensile_type_int8x4 and rocblas_gemm_flags_pack_int8x4. backend 接続: 下流の GEMM 入口は gemm_v2.cpp:475-571 の CallGemm() で、INT8/INT8x4 workload を CallGemmMIOpenTensile() または rocBLAS へ振り分ける。同じソースには miopen_tensile_type_int8x4 と rocblas_gemm_flags_pack_int8x4 も現れる。

Boundary of this claim: The source-level candidate is code_verified, but the 2026-03-18 Vega64 runtime follow-up did not promote it into a practical route. Natural selection and -s 1 still stayed on ConvDirectNaiveConvFwd; forcing -S GemmFwd1x1_0_1_int8 resolved solution id 89 and then failed as not applicable; and an only-solver search reported GetWorkspaceSizes / SearchForAllSolutions as not applicable. MLIR still rejects gfx900 at ConvMlirIgemmFwd::IsApplicable(); the exposed forward CK path remains FP32/FP16-only; and Tensile's gfx900 capability table keeps all dot4 flags disabled. この主張の境界: source-level の candidate 自体は code_verified だが、2026-03-18 の Vega64 runtime follow-up でも practical route にはならなかった。自然選択と -s 1 は依然として ConvDirectNaiveConvFwd に留まり、-S GemmFwd1x1_0_1_int8 は solution id 89 解決後に not applicable で停止し、only-solver search でも GetWorkspaceSizes / SearchForAllSolutions が not applicable を返した。MLIR は依然として ConvMlirIgemmFwd::IsApplicable() で gfx900 を reject し、露出している forward CK path は FP32/FP16 限定、Tensile の gfx900 capability table でも dot4 flags は無効のままである。

Direct query / immediate follow-up: Small probes against the same installed MIOpen library narrow the boundary further. With x=int8, w=int8, and y=int8, only solution 85 is visible and Workspace(id=89) returns bad parameter. With y=int32, the same library returns two solutions, including id=89 with workspace=200704; and a direct CompileSolution + ForwardImmediate probe for id=89 succeeds, with the first output values matching the expected 64. This suggests that the tested convint8 blockage is at least partly on the driver/output-type path, not on solver/backend absence alone. direct query / immediate の追補: 同じ installed MIOpen library へ当てた小さな probe で、境界はさらに狭まる。x=int8・w=int8・y=int8 では visible solution は 85 のみで、Workspace(id=89) は bad parameter を返す。一方 y=int32 にすると、同じ library が 2 件の solution を返し、その中に workspace=200704 を持つ id=89 が現れる。さらに id=89 に対する direct な CompileSolution + ForwardImmediate probe も成功し、先頭出力は期待どおり 64 で揃った。したがって今回の convint8 tested case の閉塞点は、少なくとも solver/backend 不在そのものではなく、driver/output-type path にも強く寄っていると読める。

Driver cast-flag follow-up: The installed MIOpenDriver convint8 --help still exposes legacy-style --out_cast_type, but this does not recreate the direct y=int32 path. In the legacy driver source, valid cast tokens are limited to fp32/fp16/bf16/fp8/bf8; int32 is rejected at parse time. Retrying the tested case with --out_cast_type fp32 is accepted, but the output tensor stays INT8 with cast metadata, the problem key changes to ...INT8-F_coFP32, and the GEMM family reports GEMM not supported with casted tensors on this GPU architecture. So the legacy-looking cast flag is a different boundary from the direct y=int32 descriptor route. driver cast flag の追補: installed MIOpenDriver convint8 --help には legacy-style の --out_cast_type がまだ見えるが、これは direct y=int32 path の再現にはならない。legacy driver source 上の valid cast token は fp32/fp16/bf16/fp8/bf8 に限られ、int32 は parse 段階で reject される。tested case を --out_cast_type fp32 で再試行すると token 自体は受理されるが、output tensor は INT8 のまま cast metadata が付く形で、problem key は ...INT8-F_coFP32 に変わり、GEMM family は GEMM not supported with casted tensors on this GPU architecture を返す。したがって、この legacy 風 cast flag は direct y=int32 descriptor route とは別境界として読むべきである。

Standard Find/Forward follow-up: The direct y=int32 path is not immediate-only. A small probe using the standard higher-level C API with explicit y=int32 descriptors also succeeds: miopenFindConvolutionForwardAlgorithm() returns GEMM plus Direct for exhaustive=0 and GEMM only for exhaustive=1; in both cases, miopenConvolutionForward() executes successfully, and the outputs again match the expected 64. This pushes the current blocked point even closer to the installed driver-side descriptor assembly. standard Find/Forward の追補: direct y=int32 path は immediate-only ではない。明示的に y=int32 descriptor を組んだ small probe で standard の higher-level C API を試すと、miopenFindConvolutionForwardAlgorithm() は exhaustive=0 で GEMM と Direct、exhaustive=1 で GEMM のみを返し、どちらの条件でも miopenConvolutionForward() は成功し、出力は再び期待どおり 64 で一致した。したがって現在の閉塞点は、installed driver 側の descriptor assembly にさらに寄る。

Local debug-build provenance follow-up: One apparent mismatch turned out to be a provenance issue. The local Debug MIOpenDriver used for cross-checking was not built from the current public standalone clone under ROCm-repos/MIOpen; its build cache points to a separate detached checkout, /home/limonene/ROCm-project/WD-Black/miopen-src at f842c61d. That checkout still retains the cast-aware convint8 driver path with --out_cast_type and an INT8 output tensor plus cast metadata. So the discrepancy between the current public driver source and the local debug binary is at least partly explained by build provenance. This does not by itself establish the provenance of the installed /opt/rocm/bin/MIOpenDriver. local debug build provenance の追補: 一見 source と runtime が食い違って見えた点の一部は provenance の問題だった。cross-check に使っていた local Debug MIOpenDriver は、current public standalone clone の ROCm-repos/MIOpen からではなく、別の detached checkout である /home/limonene/ROCm-project/WD-Black/miopen-src（f842c61d）からビルドされていた。この checkout は、--out_cast_type を持つ cast-aware convint8 driver path と、INT8 output tensor に cast metadata を後付けする挙動をなお保持している。したがって current public driver source と local debug binary の不一致は、少なくとも build provenance の差でも説明できる。ただし、これだけで installed /opt/rocm/bin/MIOpenDriver の provenance を確定することはできない。

Installed package provenance follow-up: On this host, the installed /opt/rocm/bin/MIOpenDriver is owned by the distro package miopen-hip 7.2.0-1. The stripped binary and libMIOpen.so both embed debug/source paths under /usr/src/debug/miopen-hip/rocm-libraries/projects/miopen/..., while the installed miopen/version.h reports MIOpen 3.5.1.5b515cf1bca-dirty. This narrows the installed lineage further: on this machine the driver is not read most naturally as a direct build of the current standalone clone under ROCm-repos/MIOpen, but as something closer to a cast-aware rocm-libraries/projects/miopen family. However, the exact source commit behind this packaged binary is still not established from local evidence alone, and this host-specific Arch package observation should not be generalized into a claim about every ROCm distribution. installed package provenance の追補: この host では、installed /opt/rocm/bin/MIOpenDriver は distro package の miopen-hip 7.2.0-1 に属している。stripped binary と libMIOpen.so の両方には /usr/src/debug/miopen-hip/rocm-libraries/projects/miopen/... という debug/source path が埋め込まれており、installed の miopen/version.h は MIOpen 3.5.1.5b515cf1bca-dirty を示す。したがって、この host 上の installed driver は current standalone clone の ROCm-repos/MIOpen の直接 build と読むより、cast-aware な rocm-libraries/projects/miopen family に近いものとして読む方が自然である。ただし、この packaged binary の exact source commit は local evidence だけではまだ確定できず、この Arch package 上の host-specific 観測を ROCm 配布物一般へ拡張すべきでもない。

Backend-only follow-up: The same day, standalone rocblas-bench -f gemm_ex INT8 probes on gfx900 succeeded for both a small 128x128x128 case and a conv-shaped 64x100352x64 case with norm_error_1 = 0. So the public evidence now distinguishes three layers: a visible source candidate, a blocked MIOpen convolution route on the tested case, and a standalone INT8 GEMM backend that does run on gfx900. backend 単体の追補: 同日、standalone の rocblas-bench -f gemm_ex INT8 probe は gfx900 上で、小さな 128x128x128 case と conv-shaped な 64x100352x64 case の両方に成功し、norm_error_1 = 0 を返した。したがって公開側からは、source 上に candidate が見えること、tested case では MIOpen convolution route が閉じていること、しかし standalone INT8 GEMM backend 自体は gfx900 で動くこと、この 3 層を分けて読む必要がある。

8b) Shipped Artifacts — gfx900 is distributed with tuned data and pre-compiled kernels 8b) 出荷成果物 — gfx900 はチューニング済みデータとプリコンパイル済みカーネル付きで配布されている

Key finding (2026-03-15): The ROCm 7.2 package ships pre-compiled artifacts and tuning data for gfx900 — not just source code, but actual build outputs targeting gfx900. In several metrics, gfx900 receives more shipped artifacts than RDNA 3 (gfx1100) and RDNA 2 (gfx1030). 重要発見（2026-03-15）: ROCm 7.2 パッケージには gfx900 向けのプリコンパイル済み成果物とチューニングデータが含まれている。ソースコードだけでなく、gfx900 をターゲットに指定してビルドされた実際の成果物である。複数の指標で gfx900 は RDNA 3（gfx1100）や RDNA 2（gfx1030）より多くの成果物を受けている。

Metric指標	gfx900	gfx1100	gfx1030	gfx942
MIOpen Perf DB (total lines)MIOpen Perf DB（合計行数）	169,182	none	111,296	470,080
rocBLAS pre-compiled filesrocBLAS プリコンパイルファイル数	128	96	88	242
Firmware blobsファームウェア	16	(separate)（別途）	(separate)（別途）	(separate)（別途）

# /opt/rocm/share/miopen/db/ — gfx900 Perf DB files shipped gfx900_56.HIP.fdb.txt 64,583 lines (Vega 56, 56 CU) gfx900_56.db.txt 41,835 lines gfx900_56.OpenCL.fdb.txt 1,711 lines gfx900_64.HIP.fdb.txt 59,336 lines (Vega 64, 64 CU) gfx900_64.OpenCL.fdb.txt 1,717 lines # /opt/rocm/lib/rocblas/library/ — gfx900 kernel types HH(28) ZZ(17) CC(17) SS(16) HS(16) DD(16) I8I(4) BS(4) BB(4) 4xi8I(4) # 54 of 128 files carry "fallback" in filename; 74 are tuned kernels # Note: gfx803 (Fiji) also ships Perf DB (112K lines per variant) # Note: gfx1100 (RDNA3) and gfx1200 (RDNA4) ship NO Perf DB at all

Implication: This is not residual code that "happens to still compile." A Perf DB is the result of architecture-specific tuning runs. Pre-compiled .hsaco and .co files require an explicit --offload-arch=gfx900 compilation step. Their presence in the shipped package shows that gfx900 is integrated into AMD's build-and-packaging pipeline. Whether this reflects intentional support or target-list inertia cannot be determined from outside, but the practical effect is that users receive ready-to-run gfx900 artifacts out of the box. 含意: これは「たまたままだコンパイルが通る」残留コードではない。Perf DB はアーキテクチャ固有のチューニング実行の結果である。プリコンパイル済み .hsaco / .co ファイルは明示的な --offload-arch=gfx900 コンパイル工程を必要とする。出荷パッケージへの収録は、gfx900 が AMD のビルド・パッケージングパイプラインに組み込まれていることを示す。これが意図的なサポートなのかターゲットリストの慣性なのかは外部からは判別できないが、実際の効果としてユーザは箱から出してすぐ使える gfx900 成果物を受け取っている。

rocBLAS Pre-compiled Files by Architecture (ROCm 7.2, selected) rocBLAS プリコンパイル済みファイル数（ROCm 7.2・選抜比較）

A unified bar style is used across this page: blue marks the gfx900 row, slate marks comparison architectures, and striped rows indicate zero shipped data. このページの棒グラフは共通デザインに統一した。青は gfx900 行、濃いスレートは比較対象、ストライプは出荷データ 0 を示す。

gfx900 focusgfx900 注目行 comparison arch比較対象 zero shipped data出荷データ 0

gfx900 (Vega)

128

gfx1030 (RDNA2)

gfx1100 (RDNA3)

gfx942 (MI300X)

242

Scale: 0-260 files. Values are the shipped rocblas/library pre-compiled file counts observed in the ROCm 7.2 package. スケール: 0-260 ファイル。値は ROCm 7.2 パッケージ中の rocblas/library に観測されたプリコンパイル済みファイル数。

MIOpen Perf DB Lines by Architecture (ROCm 7.2, selected) MIOpen Perf DB 行数（ROCm 7.2・選抜比較）

The same color convention is preserved here so that the gfx900 row remains easy to spot and zero-data rows are immediately visible. ここでも同じ配色ルールを使い、gfx900 行とゼロ値行が一目で読めるようにした。

gfx900 focusgfx900 注目行 comparison arch比較対象 zero shipped data出荷データ 0

gfx900 (Vega)

169k

gfx1030 (RDNA2)

111k

gfx1100 (RDNA3)

gfx942 (MI300X)

470k

Scale: 0-500 thousand lines. Values are the top observed shipped Perf DB line counts per selected architecture in ROCm 7.2. スケール: 0-500 千行。値は ROCm 7.2 で観測した、選抜アーキテクチャごとの出荷 Perf DB 行数。

9) Runtime verification summary — Vega64 (gfx900:xnack-) 9) 動的検証サマリ — Vega64（gfx900:xnack-）

Caseケース	dtype	Solver / resultソルバ / 結果	Status状態
FP32 3×3 s1 NCHW	FP32	`ConvBinWinograd3x3U`	✅ runtime_verified
FP32 1×1 s1 NCHW	FP32	`ConvAsm1x1U`	✅ runtime_verified
FP32 3×3 s2 NCHW	FP32	`ConvHipImplicitGemmV4R1Fwd`	✅ runtime_verified
FP32 3×3 s1 group=2 NCHW	FP32	`ConvBinWinogradRxSf2x3`	✅ runtime_verified
FP16 3×3 s1 NCHW	FP16	`ConvOclDirectFwd`	✅ runtime_verified
BFP16 3×3 s1 NCHW	BFP16	`GemmFwdRest`	✅ runtime_verified
Local Debug MIOpen (MLIR Off)	FP32	build + forward conv OK	✅ build_verified / runtime_verified
INT8 (all 9+ shapes/layouts)	INT8	`ConvDirectNaiveConvFwd` only	⚠ naive only — non-naive path not found
`GemmFwd1x1_0_1_int8` (natural / search / forced)	INT8 1x1	natural and `-s 1` stay naive; forced / only-solver search return not applicable	❌ candidate visible in source, not practical on this tested case
MIOpen direct solution query (`y=int32`)	INT8 1x1	`GetSolution` returns `id=89` + `id=85`; `Workspace(id=89)=200704`	✅ query-level visibility with INT32 output
MIOpen direct immediate (`y=int32`)	INT8 1x1	`CompileSolution(89)` and `ForwardImmediate(89)` succeed; first outputs are `64`	✅ execution confirmed outside default `convint8` route
`MIOpenDriver convint8 --out_cast_type fp32`	INT8 1x1	accepted as a legacy cast token, but problem becomes `...INT8-F_coFP32`; GEMM family says casted tensors not supported	❌ not equivalent to the direct `y=int32` route
`MIOpenDriver convint8` CLI surface	INT8 1x1	`--out_data_type` is not recognized; `--out_cast_type int32` is rejected at parse time; only legacy cast-type flags are visible	⚠ no obvious direct CLI syntax for the explicit `y=int32` route
MIOpen standard `Find + Forward` (`y=int32`)	INT8 1x1	`FindConvolutionForwardAlgorithm` returns GEMM as fastest for `exhaustive=0` and as the sole result for `exhaustive=1`; `ConvolutionForward` succeeds	✅ higher-level API route confirmed with explicit INT32 output
rocBLAS `gemm_ex` (standalone)	INT8 → INT32	`128x128x128` and `64x100352x64` succeed on gfx900	✅ backend-only INT8 GEMM confirmed
INT8 naive kernel disassembly	INT8	no `v_dot4*` / `sdot4`	✅ dot4-absent fallback confirmed
MLIR iGEMM (forced)	INT8 / FP32	`MIIR_INVALID_PARAM` or `Perf Db: record not found` → `boost::optional::get()`	❌ downstream failure after guard bypass
ASM v4r1 1×1 (forced, INT8)	INT8	GPU memory access fault	❌ fault
CK DLOPS (forced, 15 cases)	INT8	not applicable rc=0x3	❌ not applicable
HipImplicitGemm Xdlops (forced)	INT8/FP16/BFP16	assertion abort / compile fail	❌ fail (mfma not present)

Solver Trace — gfx900 Computation Paths in ROCm Solver トレース — ROCm における gfx900 計算経路