用語解説Glossary

ROCm・GPU アーキテクチャ・ソフトウェアスタックの用語を整理した辞書です。
緑のボックス「やさしく言うと」から読めば、専門知識ゼロでも意味がつかめます。
まず要点 → 重要 8 語 → 詳しい辞書 → 深掘り用語、の順に読めます。

Key terms from ROCm, GPU architecture, and the software stack, organized for accessibility.
Read in order: key point → 8 essential terms → detailed dictionary → deep-dive terms.

STEP 1 — まずこれだけ

STEP 1 — Start here

まずこれだけ: この調査で一番大事なのは、「gfx900 は新しい高速機能を使えないことが多いが、ROCm の能力ベース選択と fallback（予備ルート）のおかげで、古い経路では今も計算できることがある」という点です。
つまり、「完全に生きている」でも「完全に死んでいる」でもなく、 使える道だけが残っている 状態だと考えるとわかりやすいです。 Start here: The most important point of this investigation is this: gfx900 often cannot use newer fast paths, but ROCm's capability-based selection and fallback mechanisms can still keep older paths working.
In other words, gfx900 is neither "fully alive" nor "fully dead" — only some usable paths remain.

STEP 2 — 一番大事な 8 語

STEP 2 — The 8 essential terms

このページで一番大事な 8 語

The 8 most important terms

この 8 語がわかれば、調査の骨格はほぼ見えます。

Understanding these 8 terms gives you the backbone of the investigation.

Essential

ROCm

AMD GPU を動かすための大きな道具箱

The big toolbox for running AMD GPUs

ひとことで言うと: AMD GPU 向けのソフトウェア一式です。
ドライバ、ランタイム、ライブラリ、コンパイラなどがまとめて含まれています。

In one sentence: ROCm is the software stack for AMD GPUs.
It includes drivers, runtimes, libraries, and compiler tools.

この調査との関係: gfx900 が「ROCm のどの部分で今も動くか」を調べています。

Relation to this investigation: We are studying which parts of ROCm still work for gfx900.

Essential

gfx900

Radeon RX Vega 56 / 64 のチップコード

The chip code for Radeon RX Vega 56 / 64

ひとことで言うと: この調査の主人公である、少し古い AMD GPU です。
ROCm では「Vega」ではなく gfx900 というコードで扱われます。

In one sentence: The older AMD GPU at the center of this investigation.
In ROCm it is identified as gfx900, not by the product name "Vega."

大事な点: 新しい高速機能は持たないが、古い経路では今も動くことがあります。

Key point: It lacks newer fast features, but some older execution paths still work.

Essential

MIOpen

深層学習の「畳み込み担当」

The deep-learning "convolution engine"

ひとことで言うと: 深層学習でよく使う演算を GPU で速く実行するライブラリです。
特に畳み込み（convolution）の実装がこの調査の中心です。

In one sentence: A library for accelerating deep-learning operations on GPUs.
Its convolution implementations are central to this investigation.

gfx900 との関係: gfx900 では使える solver と使えない solver が分かれます。

Relation to gfx900: Some solvers still work on gfx900, while others do not.

Essential

rocBLAS

行列計算の担当

The matrix-computation engine

ひとことで言うと: 行列やベクトルの計算を GPU で高速化するライブラリです。
AI や科学計算でたくさん使われます。

In one sentence: A GPU library for fast matrix and vector operations.
Heavily used in AI and scientific computing.

gfx900 との関係: gfx900 向けのプリコンパイル済みファイルが ROCm 7.2 に同梱されています。

Relation to gfx900: ROCm 7.2 ships precompiled files for gfx900.

Essential

solver

演算を実行するための「解き方」

A "way of solving" an operation

ひとことで言うと: 同じ演算でも、実装のしかたは1つではありません。
MIOpen は複数の solver を持ち、その中から今の GPU に合うものを選びます。

In one sentence: The same operation can be implemented in multiple ways.
MIOpen keeps several solvers and picks the one that fits the current GPU.

例: Winograd / ASM / MLIR iGEMM / DirectNaive — gfx900 では Winograd と ASM が主力

Examples: Winograd / ASM / MLIR iGEMM / DirectNaive — Winograd and ASM are the workhorses on gfx900

Essential

capability-based selection

GPU の名前より「できること」で選ぶ仕組み

Choosing by what the GPU can do, not just its name

ひとことで言うと: 「この GPU は Vega だからダメ」ではなく、
「xdlops があるか」「dot4 があるか」など、持っている能力で使える経路を選ぶ考え方です。

In one sentence: Instead of saying "this GPU is Vega so it fails,"
ROCm often decides based on capabilities such as whether xdlops or dot4 are present.

大事な点: この設計のおかげで、gfx900 でも使える道が残りやすくなります。

Key point: This design makes it easier for some paths to remain usable on gfx900.

Essential

fallback

速い道がだめなときの予備の道

The backup path when the fast one is unavailable

ひとことで言うと: 新しい高速経路が使えないときに、もっと汎用的な経路へ落ちる仕組みです。
速くはないけれど、計算を続けるために大事です。

In one sentence: When a newer fast path is unavailable, ROCm falls back to a more general one.
It may be slower, but it keeps the computation running.

gfx900 との関係: gfx900 はこの fallback に助けられて生き残っている面があります。

Relation to gfx900: gfx900 survives in part because these fallback paths still exist.

Essential

Perf DB

「どの解き方が速いか」を覚えておく表

A table remembering which solver is fastest

ひとことで言うと: GPU と演算形状の組み合わせごとに、最適な solver を保存したデータです。
MIOpen はこれを見て速い solver を選びます。

In one sentence: A database storing which solver works best per GPU and operation shape.
MIOpen uses it to select fast implementations.

大事な点: gfx900 向け Perf DB が ROCm パッケージに残っているのは、とても強い証拠です。

Key point: The fact that gfx900 Perf DB entries still ship in ROCm packages is very strong evidence.

速い経路と fallback — かんたんな見取り図

Fast path vs. fallback — a simple map

flowchart LR A["Your code"] --> B["ROCm stack"] B --> C["Fast modern path"] B --> D["Older fallback path"] C --> E["Needs newer GPU features\n(xdlops, dot4 …)"] D --> F["Can still work on gfx900"] style A fill:#ddeeff,stroke:#3366aa style C fill:#E3F2FD,stroke:#1565C0 style D fill:#FFF3E0,stroke:#E65100 style E fill:#BBDEFB,stroke:#1565C0 style F fill:#C8E6C9,stroke:#2E7D32

この図の意味: ROCm には「速いけれど新しいGPU向けの道」と、「遅いけれど古いGPUでも通りやすい道」の両方があります。 gfx900 は前者では止まりやすいですが、後者では今も動くことがあります。 What this means: ROCm often has both a fast path for newer GPUs and a slower, more general fallback path. gfx900 often fails on the former, but can still work on the latter.

STEP 3 — 詳しい辞書

STEP 3 — Detailed dictionary

GPU アーキテクチャ世代の概観

GPU architecture generations at a glance

gfx番号はチップコード。ROCm では GPU を名前ではなくこのコードで識別します。

gfx numbers are chip codes. ROCm identifies GPUs by these codes, not marketing names.

timeline title AMD GPU Generations in ROCm context section GCN (Graphics Core Next) 2017 : gfx900 — Vega10, Radeon RX Vega 56/64 2018 : gfx906 — Vega20, Radeon VII, adds FP16 + dot4 section CDNA (datacenter compute) 2020 : gfx908 — MI100, first xdlops matrix units 2021 : gfx90a — MI200 (CDNA2) 2023 : gfx942 — MI300X (CDNA3) section RDNA (gaming / workstation) 2021 : gfx1030 — RDNA2, Radeon RX 6000 series 2022 : gfx1100 — RDNA3, Radeon RX 7000 series 2024 : gfx1200 — RDNA4, Radeon RX 9000 series

この調査の主役: gfx900 は GCN 世代（2017年）の GPU で、ROCm では公式非サポートですが、ソースコードを読むと多くの経路でまだ動作することが確認できます。 The subject of this investigation: gfx900 is a GCN-era (2017) GPU. While officially unsupported in ROCm, the source code shows it still works across many paths.

ハードウェア用語

Hardware terms

Hardware

gfx900

別名: Vega10 / Radeon RX Vega 56・64

Also: Vega10 / Radeon RX Vega 56, 64

AMD GCN（グラフィックスコアネクスト）第5世代のアーキテクチャコード。2017年発売の Radeon RX Vega シリーズが代表。ROCm 5.x 以降は公式非サポートだが、能力ベース設計の副産物として多くの経路が今も動作する。

Architecture code for AMD's 5th-gen GCN (Graphics Core Next). The 2017 Radeon RX Vega series is the main representative. Officially unsupported since ROCm 5.x, but many paths still work as a side effect of capability-based design.

特徴: FP32 OK / FP16 部分的 / dot4 (INT8) なし / xdlops なし

Capabilities: FP32 ✓ / FP16 partial / dot4 (INT8) ✗ / xdlops ✗

Hardware

gfx906

別名: Vega20 / Radeon VII / Instinct MI50

Also: Vega20 / Radeon VII / Instinct MI50

やさしく言うと：gfx900 の「弟分」GPU です。gfx900 より小数点の計算（FP16）が速くなり、整数の点積命令（dot4）が新たに追加されました。MIOpen の一部ソルバーは gfx900 と gfx906 を「同じ兄弟」として一緒に許可しています。

gfx900 の後継で、FP16 精度が強化され dot4（INT8 内積）命令が追加された。MIOpen の ASM v4r1 ソルバーは gfx900 と gfx906 をペアで明示的に許可している。

The successor to gfx900, with improved FP16 precision and added dot4 (INT8 dot-product) instructions. MIOpen's ASM v4r1 solver explicitly allows gfx900 and gfx906 as a pair.

追加機能: dot4 / dp4a (INT8), FP16 強化

Additions over gfx900: dot4 / dp4a (INT8), better FP16

Hardware

gfx908

別名: CDNA1 / Instinct MI100

Also: CDNA1 / Instinct MI100

やさしく言うと：AI 専用の「行列計算エンジン」を初めて積んだ GPU です。gfx900 にはない xdlops という特別な回路が入っており、AI の学習で重要な行列の掛け算を一気に大量処理できます。ROCm の高速化はここを起点に設計されたものが多いです。

AMD がデータセンター向けに新設した CDNA アーキテクチャの第1世代。行列積演算を高速化する xdlops（MFMA命令）を初搭載。ROCm の多くの最適化パスがここから始まる。

The first CDNA (Compute DNA) architecture AMD created for datacenter compute. Introduced xdlops (MFMA matrix-fused-multiply-accumulate instructions) for the first time. Many ROCm optimization paths target gfx908 and later.

重要点: xdlops (MFMA) 初搭載 — MIOpen の MLIR/XDLops 系ソルバーはここから対象

Key: First xdlops (MFMA) — MIOpen's MLIR/XDLops solver families target gfx908+

Hardware

xdlops

別名: MFMA (Matrix Fused Multiply-Accumulate)

Also: MFMA (Matrix Fused Multiply-Accumulate)

やさしく言うと：行列の掛け算専用の超高速回路です。普通の計算回路とは別に、AI でよく使う「行列積」をまとめて一気に処理できます。Vega（gfx900）にはこの回路がなく、それが「多くの高速ソルバーが使えない」一番の理由です。

行列積演算を1命令で実行できる AMD GPU の特殊命令。深層学習の大規模行列演算を大幅に高速化する。gfx908 (MI100) 以降の CDNA アーキテクチャのみに搭載。

A specialized AMD GPU instruction that executes matrix multiplication in a single operation, massively accelerating deep learning workloads. Only available on CDNA architectures from gfx908 (MI100) onward.

gfx900 との関係: gfx900 には非搭載。MIOpen の XDLops 系ソルバーが全て非適用になる主因。

Relation to gfx900: Not present on gfx900. This is why all XDLops-family solvers are non-applicable on gfx900.

Hardware

dot4 / dp4a

別名: INT8 内積命令

Also: INT8 dot-product instruction

やさしく言うと：「小さな整数を4つ、まとめて掛けて足す」専用の命令です。AI の「推論」（学習済みモデルで答えを出す処理）を速くするために使われます。Vega（gfx900）にはこの命令がないため、INT8 推論は常に一番遅い経路しか選ばれません。

4つの 8-bit 整数値の内積を1命令で計算する命令。INT8 推論の高速化に使われる。gfx906 (Vega20) 以降に搭載。gfx900 には非搭載のため、INT8 は DirectNaive（最低速）にしか自然選択されない。

An instruction computing the dot product of four 8-bit integers in one step. Used for accelerating INT8 inference. Available from gfx906 (Vega20) onward. Absent on gfx900, which is why INT8 natural selection always falls to DirectNaive on gfx900.

調査結果: gfx900 での INT8 自然選択は全件 ConvDirectNaiveConvFwd のみ（runtime_verified）

Finding: INT8 natural selection on gfx900 always yields ConvDirectNaiveConvFwd only (runtime_verified)

ROCm スタック用語

ROCm stack terms

flowchart TD USER["User code\n(PyTorch / ONNX / custom)"] USER --> FW["ML Framework\n(PyTorch-ROCm, etc.)"] FW --> MIOPEN["MIOpen\nconv / pool / BN"] FW --> ROCBLAS["rocBLAS / hipBLASLt\nmatrix multiply"] MIOPEN --> HIP["HIP\nportability layer"] ROCBLAS --> HIP HIP --> ROCR["ROCr (HSA runtime)"] ROCR --> GPU["GPU Hardware\n(gfx900, gfx908, ...)"] ROCBLAS -.->|"uses"| TENSILE["Tensile\nGEMM kernel tuner"] MIOPEN -.->|"optional"| ROCMLIR["rocMLIR\nMLIR backend"] MIOPEN -.->|"optional"| CK["Composable Kernel\nhigh-perf templates"] style USER fill:#ddeeff,stroke:#3366aa style GPU fill:#ffeecc,stroke:#cc6600

Stack

ROCm

Radeon Open Compute Platform

やさしく言うと：AMD GPU を動かすためのソフトウェアがぜんぶ入った「大きな道具箱」の名前です。ドライバ（GPU と PC を繋ぐもの）、計算ライブラリ、コンパイラ（プログラムを翻訳するもの）などがひとまとめになっています。一つの巨大プログラムではなく、多数のソフトの「集合体」です。

AMD の GPU 向けオープンソースソフトウェアスタック全体の総称。ドライバ・ランタイム・ライブラリ・コンパイラツールチェーンをまとめた「エコシステム」。単一のリポジトリではなく、多数のリポジトリのマニフェスト（管理ファイル）が「ROCm リポジトリ」。

The umbrella name for AMD's open-source GPU software stack: drivers, runtimes, libraries, and compiler toolchains. Not a single repository but a manifest coordinating many repos.

構成: drivers / HIP / MIOpen / rocBLAS / Tensile / CK / rocMLIR / ROCr など

Includes: drivers / HIP / MIOpen / rocBLAS / Tensile / CK / rocMLIR / ROCr, etc.

Stack

HIP

Heterogeneous-computing Interface for Portability

やさしく言うと：AMD GPU 向けプログラミングの「標準窓口」です。上のソフトが「GPU でこの計算をして」と頼むときの共通インターフェースで、GPU 世代の違いを吸収してくれます。

AMD GPU をプログラムするための移植性レイヤ / API。GPU 世代や実装差を抽象化し、上位ライブラリやアプリが特定 GPU に強く縛られないようにする。ROCm の多くの計算要求は、この層を通って下位ランタイムへ渡される。

A portability layer / API for programming AMD GPUs. It abstracts away differences between GPU generations and implementations so higher-level libraries and applications are less tightly tied to one specific GPU. Many ROCm compute requests pass through this layer before reaching the lower runtime.

役割: 上位コードと GPU 世代の差を吸収する「翻訳層」

Role: A "translation layer" bridging user code and GPU generations

Stack

MIOpen

Machine Intelligence Open library

深層学習のプリミティブ演算（畳み込み・プーリング・バッチ正規化など）を GPU 上で高速実行するライブラリ。PyTorch-ROCm が内部で使う。この調査の主要な解析対象。

A library for high-performance execution of deep learning primitives (convolution, pooling, batch normalization, etc.) on GPU. Used internally by PyTorch-ROCm. The primary target of this investigation.

重要点: 同一演算に複数の「ソルバー」を持ち、実行時に GPU 能力で選別する

Key: Maintains multiple "solvers" for each operation; selects at runtime based on GPU capabilities

Stack

rocBLAS

ROCm Basic Linear Algebra Subprograms

行列・ベクトル演算（GEMM など）の GPU 実装ライブラリ。内部では Tensile が生成したカーネルを使う。gfx900 向けのプリコンパイル済みカーネルが ROCm 7.2 で 128 ファイル配布されていることが確認されている。

GPU-accelerated matrix/vector (BLAS) library. Uses Tensile-generated kernels internally. The ROCm 7.2 package ships 128 pre-compiled kernel files targeting gfx900.

確認済み: gfx900 向けプリコンパイル済みファイル 128 個（ROCm 7.2）

Confirmed: 128 pre-compiled files for gfx900 shipped in ROCm 7.2

Stack

Tensile

rocBLAS のカーネル自動チューニングシステム

Auto-tuning system for rocBLAS kernels

やさしく言うと：rocBLAS の「カーネル工場」です。行列の掛け算（GEMM）の計算プログラムを、GPU の種類ごとに自動で最適化して量産します。gfx900 向けカーネルも Tensile が生成したファイルが ROCm 7.2 に入っています。

行列積（GEMM）カーネルを GPU アーキテクチャごとに最適化し、rocBLAS に収録するためのシステム。Python で書かれており、community contributor が gfx900 向けの arch parsing を追加した経緯がある。

A system that generates and tunes matrix multiplication (GEMM) kernels per GPU architecture for inclusion in rocBLAS. Written in Python; community contributors have added gfx900-specific arch parsing.

調査との関係: 外部 contributor が fallback 拡張を追加 → AMD 関連が revert した経緯あり (PR#1862→#1879)

Investigation note: External contributor added fallback extension, later reverted by AMD-affiliated contributor (PR#1862→#1879)

Stack

rocMLIR

MLIR ベースの AMD GPU コンパイラバックエンド

MLIR-based AMD GPU compiler backend

やさしく言うと：MIOpen の「新型コンパイラ経路」です。新しい技術（MLIR）を使って GPU の計算プログラムを作る担当ですが、gfx900 は入口の段階で「使わせない」ガードがかかっています。そのため gfx900 ではここには到達しません。

MLIR（Multi-Level Intermediate Representation）を使って GPU カーネルをコンパイルするバックエンド。MIOpen の MLIR iGEMM ソルバーが使用する。gfx900 は IsApplicable で除外されており、このバックエンドは到達しない。

A compiler backend using MLIR (Multi-Level Intermediate Representation) to compile GPU kernels. Used by MIOpen's MLIR iGEMM solvers. gfx900 is excluded at the IsApplicable gate and never reaches this backend.

gfx900 との関係: gfx900 は明示除外済み。MIOpen PR #1328 (commit 2407d2f) が起点。

gfx900 relation: Explicitly excluded. MIOpen PR #1328 (commit 2407d2f) is the origin.

概念・仕組み用語

Concept and mechanism terms

Concept

solver

MIOpen での「解法戦略」

a "solution strategy" in MIOpen

やさしく言うと：MIOpen の中にある「計算のやり方メニュー」です。同じ畳み込みでも、GPU やデータの形に応じて、向いているやり方が変わります。

MIOpen が convolution などの演算を実行するための実装戦略。同じ「FP32 畳み込み」に対して Winograd、ASM v4r1、MLIR iGEMM、DirectNaive など複数のソルバーが登録されており、GPU・dtype・形状の組み合わせで最適なものが選ばれる。

An implementation strategy in MIOpen for executing operations like convolution. For a single "FP32 convolution" operation, multiple solvers (Winograd, ASM v4r1, MLIR iGEMM, DirectNaive, etc.) are registered and the best one is selected based on GPU, dtype, and shape.

選択方法: IsApplicable() でフィルタし、最速のものを Perf DB で決定

Selection: Filtered by IsApplicable(), then fastest chosen using Perf DB

Concept

IsApplicable()

ソルバーの適用可能性チェック関数

solver applicability check function

やさしく言うと：ソルバーが「自分は今の GPU に使えるか？」を答えるチェック関数です。入口の守衛のようなもので、gfx900 が MLIR 系ソルバーを使おうとすると「NO（使えない）」と返します。このチェックを通過したソルバーだけが実際に使われます。

各ソルバーが「今の GPU・dtype・形状の組み合わせに使えるか」を判定する関数。gfx900 では MLIR iGEMM ソルバーが明示的に false を返す（= 使えない）実装になっている。

A function each solver implements to answer "can I handle this GPU + dtype + shape combination?" For gfx900, the MLIR iGEMM solver explicitly returns false (not applicable).

重要な実装: if(StartsWith(device_name, "gfx900")) return false; — conv_mlir_igemm_fwd.cpp より

Key code: if(StartsWith(device_name, "gfx900")) return false; — from conv_mlir_igemm_fwd.cpp

Concept

capability-based selection

能力ベース選択

capability-based selection

やさしく言うと：「この GPU は新しいか古いか」より、「この GPU にその機能があるか」を見て道を選ぶ考え方です。

「この GPU モデルだから使える / 使えない」ではなく、「このGPUが持つ能力（xdlops があるか、dot4 があるかなど）で判断する」設計方針。ROCm がこの設計を採用していることで、gfx900 のような旧世代 GPU でも能力の範囲内では動作する。

A design philosophy of selecting paths based on GPU capabilities (does it have xdlops? dot4? etc.) rather than GPU model names. Because ROCm adopts this approach, older GPUs like gfx900 still work within the limits of their capabilities.

この調査での見方: gfx900 が今も一部で動くのは、この設計の結果として読むことができます。

Basis for Hypothesis B: gfx900 survival reads as a natural by-product of capability-based design

Concept

fallback

代替経路・フォールバック

fallback path

やさしく言うと：速い専用道路が使えないときに、少し遅い一般道へ回ることです。結果は出せるけれど、時間は余分にかかりやすいです。

高速な経路が使えないとき、より汎用的（だが遅い）経路に自動的に切り替える仕組み。rocBLAS では hipBLASLt → Tensile → FP32 の多段フォールバックがある。gfx900 のような旧世代 GPU は多くの場合この経路で計算を完了できる。

The automatic switch to a more general (but slower) path when a faster one is unavailable. rocBLAS has a multi-stage fallback: hipBLASLt → Tensile → FP32. Legacy GPUs like gfx900 can typically complete computations through these fallback paths.

確認例: gfx900 FP32 conv → ConvBinWinograd3x3U, ConvAsm1x1U が自然選択（runtime_verified）

Confirmed: gfx900 FP32 conv → ConvBinWinograd3x3U, ConvAsm1x1U selected naturally (runtime_verified)

Concept

Perf DB

Performance Database / パフォーマンスデータベース

Performance Database

やさしく言うと：「この GPU とこの形なら、このやり方が速い」と覚えておく成績表のようなデータです。

MIOpen が GPU × 演算形状の組み合わせごとに、最適なチューニングパラメータを保存したデータベース。ROCm パッケージに同梱されて配布される。gfx900 向けは ROCm 7.2 で 169,182 行が確認されており、RDNA3 (gfx1100) の Perf DB が存在しない一方 gfx900 は RDNA2 (gfx1030) を超える行数を持つ。

A database storing optimal tuning parameters per GPU × operation-shape combination, shipped with ROCm packages. gfx900's Perf DB in ROCm 7.2 contains 169,182 lines — more than RDNA2 (gfx1030: 111,296), while RDNA3 (gfx1100) has none at all.

意味: Perf DB が存在 = AMD のビルド・チューニングパイプラインに組み込まれている証拠

Implication: Perf DB presence = evidence of inclusion in AMD's build and tuning pipeline

Concept

staged retreat

段階的な撤退・縮退

staged retreat

やさしく言うと：古い GPU の道が、ある日いきなり全部消えるのではなく、年ごとに少しずつ減っていくことです。

旧世代サポートを一括で打ち切らず、コンポーネントごと・層ごとに少しずつサポートを縮小していく過程。ROCm では hipCUB が gfx900 を default build から除外した一方、rocBLAS は継続しており、MIOpen の Winograd 系は 2025年にも補修が入っている。

The process of reducing support for old hardware gradually, component by component and layer by layer, rather than cutting everything at once. In ROCm: hipCUB removed gfx900 from its default build while rocBLAS continues, and MIOpen Winograd paths received patches as recently as 2025.

この調査での見方: 年表を読むと、「一括除外」ではなく「少しずつ道が減る流れ」として見えます。

Basis for Hypothesis 5: The timeline reads as "layered retreat," not a single cutoff

Concept

HSACO

HSA Code Object / AMD GPU 実行バイナリ

HSA Code Object / AMD GPU executable binary

やさしく言うと：AMD GPU 向けの「実行ファイル」です。パソコンでいう「.exe ファイル」の GPU 版です。gfx900 向けに作られた HSACO が ROCm パッケージに入っているということは、AMD が意図的に gfx900 用にビルドしていた証拠として使えます。

AMD GPU 向けのコンパイル済みカーネルバイナリ形式（拡張子 .hsaco）。`--offload-arch=gfx900` を指定してコンパイルした場合にのみ gfx900 向けの HSACO が生成される。ROCm 7.2 パッケージに gfx900 向け HSACO が含まれることは、積極的なビルドパイプラインの証拠。

A compiled kernel binary format for AMD GPUs (.hsaco extension). A gfx900-targeted HSACO is only generated if explicitly compiled with --offload-arch=gfx900. The presence of gfx900 HSACOs in ROCm 7.2 packages is evidence of an active build pipeline.

逆アセンブル: llvm-objdump で読める。この調査では v_dot4_* 命令の有無を確認した。

Disassembly: Readable via llvm-objdump. This investigation checked for the presence/absence of v_dot4_* instructions.

STEP 4 — 深掘り用語

STEP 4 — Deep-dive terms

深掘り用語

Deep-dive terms

調査をさらに深く読むための補足用語です。

Supplementary terms for reading deeper into the investigation.

Deep-dive

GCN

Graphics Core Next — gfx900 が属するアーキテクチャ系統

Graphics Core Next — the architecture family gfx900 belongs to

ひとことで言うと: 2012〜2018年ごろの AMD GPU の基盤設計です。
gfx900 (Vega) はその最終世代にあたり、後継の CDNA（計算向け）と RDNA（ゲーム向け）に分岐しました。

In one sentence: AMD's GPU architecture from roughly 2012–2018.
gfx900 (Vega) is the final generation; it was succeeded by CDNA (compute) and RDNA (gaming).

要点: GCN は xdlops を持たないが、FP32 演算と Winograd カーネルは今も有効。

Key: GCN lacks xdlops, but FP32 operations and Winograd kernels remain functional.

Deep-dive

CDNA

Compute DNA — データセンター計算向けの新しいアーキテクチャ

Compute DNA — the datacenter compute architecture

ひとことで言うと: GCN から分岐した、AI・HPC 向けの AMD GPU 設計です。
gfx908 (MI100)、gfx90a (MI200)、gfx942 (MI300X) が属し、xdlops を搭載しています。

In one sentence: AMD's AI/HPC GPU architecture branch, split from GCN.
Includes gfx908 (MI100), gfx90a (MI200), gfx942 (MI300X). Has xdlops.

gfx900 との違い: xdlops を搭載し、MIOpen の高速ソルバーの大半が CDNA 以降をターゲットにしている。

Difference from gfx900: Has xdlops; most MIOpen fast solvers target CDNA and later.

Deep-dive

RDNA

Radeon DNA — ゲーム・ワークステーション向けの新しいアーキテクチャ

Radeon DNA — the gaming and workstation architecture

ひとことで言うと: GCN から分岐した、ゲーム向けの AMD GPU 設計です。
gfx1030 (RDNA2)、gfx1100 (RDNA3)、gfx1200 (RDNA4) が属します。ROCm での計算サポートは CDNA ほど充実していません。

In one sentence: AMD's gaming GPU architecture, also split from GCN.
Includes gfx1030 (RDNA2), gfx1100 (RDNA3), gfx1200 (RDNA4). ROCm compute support is less complete than CDNA.

注目点: gfx1100 (RDNA3) の Perf DB は 0 行。一方 gfx900 は 169,182 行。

Notable: gfx1100 (RDNA3) has 0 Perf DB lines, while gfx900 has 169,182.

Deep-dive

legacy path

古いけれど残っている実行経路

An old execution path that still exists

ひとことで言うと: 新しいアーキテクチャには別の最適化経路があるが、古い世代のために作られた経路が削除されずに残っているもの。
gfx900 が動作できる主な理由の一つです。

In one sentence: An execution path originally built for older hardware that was never removed.
A major reason gfx900 can still function.

例: ConvAsm1x1U, ConvBinWinograd3x3U — どちらも gfx900 で runtime_verified 済み。

Examples: ConvAsm1x1U, ConvBinWinograd3x3U — both runtime_verified on gfx900.

Deep-dive

default build target

「ソースにある」と「標準でビルドされる」の違い

The difference between "in the source" and "built by default"

ひとことで言うと: ソースコードに gfx900 対応が含まれていても、ビルド設定のデフォルトに含まれていなければ通常のパッケージには入りません。
「対応コードあり ≠ 公式ビルド済み」という区別が重要です。

In one sentence: Even when source code contains gfx900 support, it may not be in the default build configuration.
"Code exists" ≠ "officially built and shipped."

例: hipCUB は gfx900 を default build から除外。一方 rocBLAS は今もデフォルトでビルドされている。

Example: hipCUB excluded gfx900 from default build, while rocBLAS still builds for it.

Deep-dive

runtime support

「ビルド時」と「実行時」のサポートの違い

The difference between build-time and run-time support

ひとことで言うと: ビルド時に GPU ターゲットに入っていても、実行時に適切なカーネルが選ばれなければ動きません。
逆に、ビルドに含まれていなくても fallback 経路で実行時に動く場合もあります。

In one sentence: A GPU may be a build target but still fail at runtime if no suitable kernel is selected.
Conversely, it may work at runtime via fallback even without explicit build inclusion.

調査の核心: gfx900 は「公式サポート外」だが runtime では多くの経路で動作する — この区別が重要。

Central to investigation: gfx900 is "officially unsupported" yet works at runtime on many paths.

Deep-dive

lazy loading

必要になるまでカーネルを読み込まない仕組み

Loading kernels only when actually needed

ひとことで言うと: rocBLAS / Tensile は起動時に全カーネルを読み込むのではなく、実行時に必要なカーネルだけを読み込みます。
gfx900 向けカーネルがパッケージに含まれていれば、この仕組みで読み込まれる可能性があります。

In one sentence: rocBLAS/Tensile load kernels on demand at runtime, not all at startup.
If gfx900 kernels are present in the package, this mechanism can load them.

要点: パッケージに gfx900 向けファイルが存在する＝ lazy loading で読まれうる。

Key: gfx900 files present in the package can be loaded lazily at runtime.

Deep-dive

gating

特定の GPU を実行経路の入口で止める仕組み

Blocking certain GPUs at the entrance to an execution path

ひとことで言うと: コードの中に「この GPU だったら使わせない」というチェックを入れることです。
MIOpen では IsApplicable() がこの役割を担います。

In one sentence: A check in code that says "do not allow this GPU to use this path."
In MIOpen, IsApplicable() serves this role.

例: if(StartsWith(name, "gfx900")) return false; — MLIR iGEMM ソルバーでの gating。

Example: if(StartsWith(name, "gfx900")) return false; — gating in the MLIR iGEMM solver.

Deep-dive

deprecated

「非推奨」— 完全削除とは違う

"Deprecated" — not the same as removed

ひとことで言うと: 「非推奨」とは「今後サポートを減らしていく予定」という意味で、すぐに動かなくなるわけではありません。
gfx900 は ROCm 5.x 以降 deprecated ですが、コードの多くの部分ではまだ機能しています。

In one sentence: "Deprecated" means "planned for reduced support," not "immediately broken."
gfx900 has been deprecated since ROCm 5.x, but many code paths still function.

大事な区別: deprecated（非推奨）→ removed from default build（ビルド除外）→ code deleted（コード削除）は別の段階。

Key distinction: deprecated → removed from default build → code deleted are separate stages.

Deep-dive

private issue

公開されていないバグ報告・議論

Non-public bug reports and discussions

ひとことで言うと: GitHub 上で非公開に設定された issue のことです。
コードに「この private issue が原因で変更した」と参照があっても、外部から本文を読むことはできません。

In one sentence: GitHub issues marked as non-public.
Code may reference them, but their contents cannot be read from outside the organization.

調査における限界: 公開側から確認できるのは参照関係と gating の痕跡に限られる。内容は断定できない。

Limitation: Only reference relationships and gating traces are observable from the public side.