Note / 注記: This presentation organizes observations from publicly available sources and local repository clones only. It does not assert the contents of private issues or internal decision-making processes. / 本資料は、公開一次資料およびローカル clone から観測可能な範囲を整理したものであり、非公開 issue や社内意思決定の内容を断定するものではない。
Language

Backend-Dependent Behavior on Legacy HBM GPUs

Advanced Evidence Edition: ROCm/HIP vs Vulkan on Vega/gfx900 — Within a Single Tested Setup

Akira Ito | AETS (Akatsuki Enterprise Technology Solutions) | aets-giken@hiroshima-aktk.com

IEICE GNW-68 | Kyushu Sangyo University | March 9, 2026

1

Background & Motivation

Scope is explicitly this environment + these runs. No general claim about Vega across all setups is made.
2

Problem Statement

Backend path isolation — not hardware condemnation — is the diagnostic question.
3

Evidence from Investigation: gfx900 Gate Matrix

LayerCurrent handling of gfx900Blocks / Allows
Official matrix (ROCm 7.2)gfx900 not listed in GPU target listBlocks (official scope)
CMake filter (ollama)CMakeLists.txt:127 default regex ^gfx(94[012]|101[02]|1030|110[012]|120[01])$ — gfx900 excluded; manual AMDGPU_TARGETS=gfx900 override possibleBlocks default / Allows manual
rocBLAS artifactsKernels.so-000-gfx900.hsaco found in /usr/lib/ollama/rocblas/library/Allows (this env)
ROCm 7.2 packageMIOpen Perf DB: gfx900 = 169 K lines (gfx1100: none); rocBLAS: 128 files (gfx1100: 96); firmware: 16× vega10 blobsShipped (exceeds RDNA 3)
Runner init validationJournal: "Device 0: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64" — recognized as valid agentAllows (this env)
Source macroscommon.cuh:65 GGML_CUDA_CC_VEGA=0x1000900; __gfx900__ dp4a path retained (software emulation via inline asm)Allows
Runtime executionIn the matched runs, ROCm/HIP (override-assisted): all tested num_gpu values OK. Vulkan: SIGSEGV for num_gpu >= 1.Backend-dependent (tested scope)

Source: work_log/investigations/2026-03-07_gfx900_gate_matrix.md | ollama/CMakeLists.txt:127 | ggml-cuda/common.cuh:65

Six of seven examined layers retained paths allowing execution in this environment. Official non-listing ≠ immediate impossibility — but scope is restricted to this setup.
4

CMake Default Filter: Exact Source Evidence

Source: ollama/CMakeLists.txt:121-128

check_language(HIP)
if(CMAKE_HIP_COMPILER)
    if(NOT AMDGPU_TARGETS)
        find_package(hip REQUIRED)
        list(FILTER AMDGPU_TARGETS INCLUDE REGEX
            "^gfx(94[012]|101[02]|1030|110[012]|120[01])$")
    endif()
    if(AMDGPU_TARGETS)
        add_subdirectory(.../ggml-hip)
Regex groupTargets includedArchitecture
94[012]gfx940, gfx941, gfx942CDNA3 (MI300)
101[02]gfx1010, gfx1012RDNA1 (RX 5000)
1030gfx1030RDNA2 (RX 6000)
110[012]gfx1100, gfx1101, gfx1102RDNA3 (RX 7000)
120[01]gfx1200, gfx1201RDNA4 (RX 9000)
(absent)gfx900 — NOT includedGCN5 / Vega56/64
gfx900 is excluded from ollama's default HIP build. The installed libggml-hip.so required an explicit AMDGPU_TARGETS override or pre-compiled artifact.
5

Code Trace: num_gpu Semantics (Exact Source Lines)

StageCode locationExact content / Interpretation
Python Clientollama/_types.py:104-110class Options: num_gpu: Optional[int] = None — raw int, no semantic annotation
API callollama/_client.py:281-305Passed as options=options in /api/generate POST body
Go typeapi/types.go:604NumGPU int `json:"num_gpu,omitempty"` in Runner struct
Default valueapi/types.go:1075NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically
CLI help textcmd/interactive.go:112"/set parameter num_gpu <int> The number of layers to send to the GPU"
Serverllm/server.go:992, 1063-1076assignLayers(..., s.options.NumGPU, ...); requestedLayers upper-bounded by len(layers)
Runnerrunner/llamarunner/runner.go:906-924numGPU += len(layers.Layers); NumGpuLayers: numGPU
Bridgellama/llama.go:264-267cparams.n_gpu_layers = C.int(params.NumGpuLayers)
llama.cpp APIinclude/llama.h:289int32_t n_gpu_layers; // number of layers to store in VRAM, a negative value means all layers
num_gpu = offloaded layer count, not GPU device count. Journal confirms: "offloaded 2/25 layers to GPU" when num_gpu=2. Do not treat num_gpu>=2 as multi-GPU evidence.
6

HIP Backend: gfx900 Architecture Constants & dp4a Emulation

Constant / MacroValueComment (ggml-cuda/common.cuh)
GGML_CUDA_CC_GCN40x1000803Tonga/Fiji/Polaris — minimum for fast fp16 (line 64)
GGML_CUDA_CC_VEGA0x1000900Vega56/64 — minimum for fp16 dual issue (line 65)
GGML_CUDA_CC_VEGA200x1000906MI50/Radeon VII — minimum for dp4a hardware (line 66)
GGML_CUDA_CC_IS_GCN(cc)OFFSET_AMD < cc < CDNA1gfx900 falls in GCN range, NOT CDNA (line 86)

dp4a dispatch in ggml_cuda_dp4a() (common.cuh:666–704):

ConditionImplementationCost
CDNA / RDNA2 / __gfx906____builtin_amdgcn_sdot4(a, b, c, false)1 instruction (hardware)
RDNA3 / RDNA4__builtin_amdgcn_sudot4(...)1 instruction (hardware)
RDNA1 || __gfx900__6-insn inline asm: v_mul_i32_i24 ×4 + v_add3_u32 ×2Software emulation (line 672–685)

V_DOT2_F32_F16_AVAILABLE (common.cuh:715) also excludes gfx900 — no hardware fp16 dot product instruction.

gfx900 executes in the HIP backend but uses software dp4a emulation — a documented architectural limitation, not a bug.
7

Engineering Challenges

The three independent variables — support policy, backend choice, and override configuration — must be held separate in any causal claim.
8

Evidence-First Investigation Strategy

LayerInterventionEffect
L1: Distribution / BuildInspect support matrix, presets, target filters, installed artifactsLocate where gfx900 is blocked/allowed at the build layer
L2: API SemanticsTrace num_gpu from client → server → runner → llama.cppEstablish correct meaning: offloaded layers, not device count
L3: Runtime ComparisonMatched ROCm (:11435) vs Vulkan (:11434) — same hardware, same model, same workloadIsolate backend-specific failure in these runs
L4: Evidence Captureresult.json, journal, backend_probe, rocm-smi, ollama psReproducible claims with run_id and file-path linkage

Principle: falsifiable diagnosis over optimistic one-off success.

Each layer answers a distinct question. Collapsing them leads to misattribution.
9

Experimental Setup

ComponentSpecification
GPUAMD Radeon RX Vega 56 (gfx900), 8 GB HBM2
OS / KernelEndeavourOS, Kernel 6.12.74-1-lts
ROCm endpointOllama 0.17.5 (03-07) / 0.17.6 (03-08) via :11435, libggml-hip.so
Vulkan endpointOllama 0.17.4 via :11434, libggml-vulkan.so
Model / workloadqwen3.5:2b, NUM_PREDICT=512, num_gpu=0,1,2,-1
ROCm overrideHSA_OVERRIDE_GFX_VERSION=9.0.0 (ROCm service only; Vulkan service: no override applied)
Override note: The ROCm result was obtained under an override-assisted service configuration (HSA_OVERRIDE_GFX_VERSION=9.0.0). This supports conditional executability in this setup — not official support restoration. The Vulkan service ran without any override.
Two services, same hardware. Matched one-epoch runs were repeated across two days; ROCm result is override-dependent, Vulkan result is not.
10

Matched Results — Reproduced Across Two Days

num_gpuROCm (:11435) run_20260307_012643 [override-assisted]Vulkan (:11434) run_20260307_013050 [no override]
0OK (46.689s, eval_count=512)OK (45.002s, eval_count=512)
1OK (48.742s)HTTP 500 + SIGSEGV (2.672s)
2OK (47.748s)HTTP 500 + SIGSEGV (7.640s)
-1OK (44.265s)HTTP 500 + SIGSEGV (9.816s)
Easy reading guide (non-specialist)
  • num_gpu means how many "thinking blocks" are sent to GPU, not how many GPU cards exist.
  • OK means the answer finished normally. HTTP 500 + SIGSEGV means the program crashed during calculation.
  • When num_gpu=0, both backends work. When num_gpu is 1 or more, only Vulkan crashes in this qwen test.
  • Seconds (s) are total time; in error rows this is mostly "time until crash".

Matched rerun on 2026-03-08: ROCm run_20260308_201305 (0.17.6, override-assisted) again passed all tested values; Vulkan run_20260308_201641 again failed for num_gpu=1/2/-1 with nearly identical crash times (2.70 / 7.64 / 9.81s).

Observation (in these matched qwen3.5:2b runs): Vulkan failure is conditional on num_gpu >= 1, not universal. ROCm/HIP passed all tested values under override-assisted configuration on both days.

The ROCm/HIP vs Vulkan difference for qwen3.5:2b is no longer a one-off matched run; the same pattern was reproduced on 2026-03-07 and 2026-03-08.
11

Run-Level Data Snapshot

Runnum_gpustatuselapsed_seceval_duration(ns)response_chars
run_20260307_012643 (ROCm)0ok46.689420165401550
run_20260307_012643 (ROCm)1ok48.742438536700721809
run_20260307_012643 (ROCm)2ok47.748423074749361951
run_20260307_012643 (ROCm)-1ok44.265420845984101849
run_20260307_013050 (Vulkan)0ok45.002417786650970
run_20260307_013050 (Vulkan)1/2/-1error2.672 / 7.640 / 9.816
run_20260308_201305 (ROCm)0/1/2/-1all ok47.98 / 49.18 / 48.30 / 45.17512 eachnormal completion
run_20260308_201641 (Vulkan)0ok45.53512normal completion
run_20260308_201641 (Vulkan)1/2/-1error2.70 / 7.64 / 9.81
run_20260308_201924 (Vulkan)0 × 5 epochs5/5 ok50.03 / 42.46 / 43.18 / 42.67 / 43.07512 eachno crash
What each column tells us (simple words)
  • Run: one experiment batch ID (like a notebook page number).
  • num_gpu: how many model layers are moved to GPU.
  • status: success or failure.
  • elapsed_sec: wall-clock seconds until finish (or until crash).
  • eval_duration(ns): model calculation time only; bigger number means slower compute.
  • response_chars: output text length. Zero does not always mean crash.

Data source: vega_work_log/*/result.json | 16 matched-record rows extracted from the full log set.

Raw numbers now include one matched rerun and a 5-epoch Vulkan num_gpu=0 stability check. Error elapsed time reflects time-to-crash, not inference time.
12

Vulkan Failure Timeline (Journal Evidence)

Timestamp (JST)EventMeaning
01:31:37load_backend: libggml-vulkan.so loaded; "int dot: 0 | matrix cores: none"Backend init completed; hardware feature caps confirmed
01:31:38offloaded 1/25 layers to GPU (Vulkan0); compute graph Vulkan0: 257.1 MiBModel load and GPU allocation completed normally
01:31:39llama runner started in 1.72 secondsRunner startup succeeded
01:31:39SIGSEGV: segmentation violation (PC=0x7efbc5b801b5, addr=0x3a8000)Runner process crash — first inference batch, not during load
01:31:39ggml_backend_sched_graph_compute_async → computeBatch(runner.go:716)Crash in compute graph scheduling path, via cgo
01:31:39server.go: post predict … EOFAPI side sees runner termination as EOF → HTTP 500
01:31:47 / 01:31:57Same sequence repeats for num_gpu=2 and -1Consistent pattern within these runs for all GPU-offload cases
What each column tells us (simple words)
  • Timestamp: when each step happened.
  • Event: what the system said or did at that moment.
  • Meaning: why that step matters for our diagnosis.
  • The key story is: load finished first, then the crash happened when real calculation started.

Source: vega_work_log/run_20260307_013050/ollama_journal_since_start.txt

Crash occurs at first compute, not at load — consistent with a compute-path issue, not an allocation or init issue.
13

Vulkan Crash: Full Stack Trace (Journal, 01:31:39 JST)

From: vega_work_log/run_20260307_013050/ollama_journal_since_start.txt — immediately after runner reported ready

SIGSEGV: segmentation violation
PC=0x7efbc5b801b5 m=41 sigcode=1 addr=0x3a8000
signal arrived during cgo execution

goroutine 853 [syscall]:
runtime.cgocall(0x55e54a0e2710, 0xc0000bdaa0)
    runtime/cgocall.go:167
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(...)
    _cgo_gotypes.go:979
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(...)
    ml/backend/ggml/ggml.go:825 +0x1b2
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...)
    runner/ollamarunner/runner.go:716 +0x862
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
    runner/ollamarunner/runner.go:459
How to read this block (simple words)
  • SIGSEGV means the program touched memory in a bad way and crashed.
  • cgo execution means the Go program was inside C/C++ backend code at the moment of failure.
  • computeBatch means this happened during real model calculation, not while just loading files.
  • This block does not tell us the exact Vulkan API call yet; it tells us which part of the software stack was active when it crashed.

Crash point: C function ggml_backend_sched_graph_compute_async called via cgo. Go runtime receives SIGSEGV and terminates the runner. The exact Vulkan API call within that C function is not resolved from this trace alone.

Stack trace is factual. Root cause within ggml_backend_sched_graph_compute_async (C/Vulkan side) remains unresolved — this is a stated limitation.
14

Targeted Follow-Up Tests: Model Dependence & Crash Timing

TestRun / ConditionObserved resultInterpretation
Model swaprun_20260308_212345, Vulkan, MODEL=phi4-mini:latest, NUM_GPU=0,1num_gpu=0: ok (18.82s, 318 tok); num_gpu=1: ok (31.55s, 512 tok)Offload crash is not universal on gfx900 Vulkan; it is model/workload-dependent.
Token-count controlrun_20260308_212254, Vulkan, qwen3.5:2b, NUM_PREDICT=128, NUM_GPU=1error at 2.858s (vs mean 2.653s at NUM_PREDICT=512)Crash timing is largely independent of requested output length.
Timing statistics3 qwen3.5:2b Vulkan runs, num_gpu>=1num_gpu=1: 2.653s ±0.059s; 2: 7.630s ±0.014s; -1: 9.819s ±0.016sFailure timing is highly deterministic, not obviously a random race.
What each column tells us (simple words)
  • Test: what extra question we asked.
  • Run / Condition: which model and settings we used.
  • Observed result: what actually happened.
  • Interpretation: the simplest lesson we can take from that result.
  • The big lesson here is: the crash is not universal, and it is not mainly caused by asking for more output tokens.

Source: run_20260308_212345, run_20260308_212254, and analysis_summary.md derived from 15 runs / 119 records in vega_work_log/.

Refined claim: qwen3.5:2b Q8_0 reproducibly crashes on Vulkan offload in this setup, but phi4-mini does not, and the crash timing is largely token-count independent. The failure is backend-sensitive and model-dependent.
15

Vulkan: gfx900 AMD_GCN Classification & Feature Impact

Code locationContentImpact on gfx900
ggml-vulkan.cpp:250-252enum vk_device_architecture { OTHER, AMD_GCN, AMD_RDNA1, AMD_RDNA2, AMD_RDNA3 }gfx900 → AMD_GCN branch
ggml-vulkan.cpp:296-297if (maxSubgroupSize==64 && minSubgroupSize==64) return AMD_GCN;GCN wave64-only → classified AMD_GCN
ggml-vulkan.cpp:3964use_subgroups = subgroup_arithmetic && arch != AMD_GCNSubgroup arithmetic disabled for gfx900
ggml-vulkan.cpp:3241-3243AMD_GCN + open-source driver → altered warptile: {256,64,64,32,16,16,2,2,2,1,16}Different matrix multiply tile parameters from RDNA
ggml-vulkan.cpp:3953-3957AMD_GCN: rm_stdq=2, rm_kq=4, rm_stdq_int=4Different row-reduce parameters

Journal evidence (01:31:37):

ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) |
  uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 |
  shared memory: 65536 | int dot: 0 | matrix cores: none

int dot: 0 = VK_KHR_shader_integer_dot_product not hardware-accelerated. matrix cores: none = no cooperative matrix. Both are consistent with source-level AMD_GCN code paths.

Vulkan correctly identifies gfx900 as AMD_GCN and applies GCN-specific code paths. Whether those paths are crash-safe for this GPU is what these runs expose.
16

Deep Code Tracking Summary (11-Point Trace)

PointFile / LineWhat it establishes
1–2ollama/_types.py:104-110, _client.py:281-305num_gpu is a load-time option forwarded as-is to /api/generate; no GPU-count semantics in client
3–4ollama/api/types.go:600-608, 1071-1076NumGPU belongs to model-load options; default -1 means dynamic policy, not "use all GPUs"
5ollama/cmd/interactive.go:112CLI defines num_gpu explicitly as "the number of layers to send to the GPU"
6–7ollama/llm/server.go:992, 1063-1076NumGPU is passed as requestedLayers; capped at len(layers) — confirms layer semantics
8runner/llamarunner/runner.go:906-924NumGpuLayers accumulated from layer counts per device — not device count
9–11llama/llama.go:264-267, include/llama.h:289n_gpu_layers: "number of layers to store in VRAM; negative value means all layers"
Nine independent source points converge on the same conclusion: num_gpu controls layer offload count, not GPU device selection.
17

Failure Diagnosis

HypothesisExpected evidenceObserved (in these runs)
Vega is generally unusableBoth backends fail on same GPUContradicted: ROCm/HIP succeeded in these runs
num_gpu means GPU countFailure implies multi-GPU entryContradicted: code trace establishes layer-count semantics
Model-/workload-dependent Vulkan compute-path issue on AMD_GCNqwen3.5:2b fails at first compute batch via cgo, while phi4-mini completes at num_gpu=1Consistent with observed evidence

Four-Tier Verdict (Based on Collected Evidence)

TierStatement
1. Official supportgfx900 is not officially supported in ROCm 7.2 or ollama's default build targets.
2. Conditional executabilityIn this environment, with HSA_OVERRIDE_GFX_VERSION=9.0.0 and pre-existing artifacts, ROCm/HIP ran successfully across all tested num_gpu values on two matched-run days.
3. Shipped artifactsThe ROCm 7.2 package ships gfx900 Perf DB (169 K lines), 128 rocBLAS pre-compiled kernels, and 16 firmware blobs — exceeding gfx1100 (RDNA 3) and gfx1030 (RDNA 2) on multiple metrics. This implies active build-pipeline inclusion.
4. Backend-dependent stabilityUnder matched qwen3.5:2b conditions across two days, Vulkan SIGSEGV'd for num_gpu>=1 while ROCm/HIP did not; Vulkan remained stable at num_gpu=0, including a 5-epoch run. A phi4-mini follow-up completed at num_gpu=1, indicating model dependence.

rocBLAS Pre-compiled Files

gfx942 242
gfx906 156
gfx900 128
gfx1100 96
gfx1030 88

MIOpen Perf DB (top variant, ×1 K lines)

gfx942 470 K
gfx906 235 K
gfx900 169 K
gfx1030 111 K
gfx1100 0
The four tiers are independent claims. Tier 2 does not contradict Tier 1; Tier 3 shows that vendor packaging goes beyond code residue; Tier 4 is supported by two matched qwen runs, a 5-epoch Vulkan num_gpu=0 control, a phi4-mini counterexample, and deterministic crash-timing statistics.
18

Reproducibility Package

ArtifactPathRole
Primary metricsvega_work_log/run_*/result.jsonStatus, latency, error_type per num_gpu case
Crash phase logsvega_work_log/run_*/ollama_journal_since_start.txtFull stack trace and EOF timing
Backend identityvega_work_log/run_*/backend_probe.txtConfirms ROCm vs Vulkan service path, device recognition
Aggregate analysisanalyze_runs.py + analysis_summary.mdFull-log aggregation over 15 runs / 119 records; crash timing statistics and pivots
Code semantics tracework_log/investigations/2026-03-07_numgpu_semantics_trace.mdLine-level evidence for num_gpu semantics
Gate analysiswork_log/investigations/2026-03-07_gfx900_gate_matrix.mdLayered block/allow model for gfx900

Replay command: OLLAMA_HOST=http://127.0.0.1:11435|11434 EPOCHS=1 NUM_PREDICT=512 NUM_GPU=0,1,2,-1 python vega-loop_qwen_rocm.py

Stability check: OLLAMA_HOST=http://127.0.0.1:11434 EPOCHS=5 NUM_GPU=0 NUM_PREDICT=512 python vega-loop_qwen.py

Every claim in this deck is linked to a run_id + file path. The matched rerun, the Vulkan num_gpu=0 control, the targeted phi4-mini / NUM_PREDICT follow-ups, and the aggregate analysis are all reproducible from the current files.
19

Limitations & Future Work

Limitations:

Future work: Additional models and quantization formats; matched Ollama version pair; same-version override-off vs override-on comparison; deeper Vulkan trace (RADV validation layers); broader multi-epoch stress under GPU-offload settings.

The new rerun removes the strongest one-off criticism, but the center of gravity has shifted: the key unresolved variable is now model/workload specificity, alongside version mismatch and override dependence.
20

Takeaway & Key Messages

Not officially supported

gfx900 is absent from the ROCm 7.2 support matrix and ollama's default HIP build targets. This is a factual statement about vendor policy, not execution capability.

Conditionally executable

In this environment, under override-assisted configuration, ROCm/HIP ran across all tested num_gpu values on both matched-run days. Five of six examined gate layers retained executable paths.

Shipped beyond code residue

ROCm 7.2 ships gfx900 Perf DB (169 K lines), 128 rocBLAS kernels, and 16 firmware blobs — exceeding RDNA 3/2 on multiple metrics. This implies active build-pipeline inclusion, not passive code survival.

Backend-dependent stability

For qwen3.5:2b across two matched runs, Vulkan SIGSEGV'd for num_gpu>=1 while ROCm/HIP did not; Vulkan stayed stable at num_gpu=0 for 5/5 epochs. A phi4-mini follow-up passed at num_gpu=1, so the failure is not universal to gfx900 Vulkan.

Based on the collected evidence: "not officially supported" and "not executable in any configuration" are distinct statements. The shipped-artifacts finding adds a fourth dimension: vendor packaging actively includes gfx900 in ways that surpass some supported architectures.

本資料が主張しないこと / This presentation does not claim that...

  • AMD の社内意思決定過程を断定するものではない / Internal decision-making processes are asserted.
  • llvm-project-private#389 の内容を推定で補完するものではない / The content of the private issue has been inferred.
  • 本資料の事例が ROCm 全体の一般法則として確定しているとするものではない / A single case is generalized into a universal rule.
  • AMD の support policy 全体を完全に代表するものではない / AMD's support policy as a whole is fully represented.
  • AMD または特定個人への批判を意図するものではない / Any specific organization is being criticized.