Akira Ito | AETS (Akatsuki Enterprise Technology Solutions) | aets-giken@hiroshima-aktk.com
IEICE GNW-68 | Kyushu Sangyo University | March 9, 2026
| Layer | Current handling of gfx900 | Blocks / Allows |
|---|---|---|
| Official matrix (ROCm 7.2) | gfx900 not listed in GPU target list | Blocks (official scope) |
| CMake filter (ollama) | CMakeLists.txt:127 default regex ^gfx(94[012]|101[02]|1030|110[012]|120[01])$ — gfx900 excluded; manual AMDGPU_TARGETS=gfx900 override possible | Blocks default / Allows manual |
| rocBLAS artifacts | Kernels.so-000-gfx900.hsaco found in /usr/lib/ollama/rocblas/library/ | Allows (this env) |
| ROCm 7.2 package | MIOpen Perf DB: gfx900 = 169 K lines (gfx1100: none); rocBLAS: 128 files (gfx1100: 96); firmware: 16× vega10 blobs | Shipped (exceeds RDNA 3) |
| Runner init validation | Journal: "Device 0: AMD Radeon RX Vega, gfx900:xnack- (0x900), VMM: no, Wave Size: 64" — recognized as valid agent | Allows (this env) |
| Source macros | common.cuh:65 GGML_CUDA_CC_VEGA=0x1000900; __gfx900__ dp4a path retained (software emulation via inline asm) | Allows |
| Runtime execution | In the matched runs, ROCm/HIP (override-assisted): all tested num_gpu values OK. Vulkan: SIGSEGV for num_gpu >= 1. | Backend-dependent (tested scope) |
Source: work_log/investigations/2026-03-07_gfx900_gate_matrix.md | ollama/CMakeLists.txt:127 | ggml-cuda/common.cuh:65
Source: ollama/CMakeLists.txt:121-128
check_language(HIP)
if(CMAKE_HIP_COMPILER)
if(NOT AMDGPU_TARGETS)
find_package(hip REQUIRED)
list(FILTER AMDGPU_TARGETS INCLUDE REGEX
"^gfx(94[012]|101[02]|1030|110[012]|120[01])$")
endif()
if(AMDGPU_TARGETS)
add_subdirectory(.../ggml-hip)
| Regex group | Targets included | Architecture |
|---|---|---|
94[012] | gfx940, gfx941, gfx942 | CDNA3 (MI300) |
101[02] | gfx1010, gfx1012 | RDNA1 (RX 5000) |
1030 | gfx1030 | RDNA2 (RX 6000) |
110[012] | gfx1100, gfx1101, gfx1102 | RDNA3 (RX 7000) |
120[01] | gfx1200, gfx1201 | RDNA4 (RX 9000) |
| (absent) | gfx900 — NOT included | GCN5 / Vega56/64 |
| Stage | Code location | Exact content / Interpretation |
|---|---|---|
| Python Client | ollama/_types.py:104-110 | class Options: num_gpu: Optional[int] = None — raw int, no semantic annotation |
| API call | ollama/_client.py:281-305 | Passed as options=options in /api/generate POST body |
| Go type | api/types.go:604 | NumGPU int `json:"num_gpu,omitempty"` in Runner struct |
| Default value | api/types.go:1075 | NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically |
| CLI help text | cmd/interactive.go:112 | "/set parameter num_gpu <int> The number of layers to send to the GPU" |
| Server | llm/server.go:992, 1063-1076 | assignLayers(..., s.options.NumGPU, ...); requestedLayers upper-bounded by len(layers) |
| Runner | runner/llamarunner/runner.go:906-924 | numGPU += len(layers.Layers); NumGpuLayers: numGPU |
| Bridge | llama/llama.go:264-267 | cparams.n_gpu_layers = C.int(params.NumGpuLayers) |
| llama.cpp API | include/llama.h:289 | int32_t n_gpu_layers; // number of layers to store in VRAM, a negative value means all layers |
| Constant / Macro | Value | Comment (ggml-cuda/common.cuh) |
|---|---|---|
GGML_CUDA_CC_GCN4 | 0x1000803 | Tonga/Fiji/Polaris — minimum for fast fp16 (line 64) |
GGML_CUDA_CC_VEGA | 0x1000900 | Vega56/64 — minimum for fp16 dual issue (line 65) |
GGML_CUDA_CC_VEGA20 | 0x1000906 | MI50/Radeon VII — minimum for dp4a hardware (line 66) |
GGML_CUDA_CC_IS_GCN(cc) | OFFSET_AMD < cc < CDNA1 | gfx900 falls in GCN range, NOT CDNA (line 86) |
dp4a dispatch in ggml_cuda_dp4a() (common.cuh:666–704):
| Condition | Implementation | Cost |
|---|---|---|
CDNA / RDNA2 / __gfx906__ | __builtin_amdgcn_sdot4(a, b, c, false) | 1 instruction (hardware) |
| RDNA3 / RDNA4 | __builtin_amdgcn_sudot4(...) | 1 instruction (hardware) |
RDNA1 || __gfx900__ | 6-insn inline asm: v_mul_i32_i24 ×4 + v_add3_u32 ×2 | Software emulation (line 672–685) |
V_DOT2_F32_F16_AVAILABLE (common.cuh:715) also excludes gfx900 — no hardware fp16 dot product instruction.
num_gpu is often misread as device count, causing wrong failure attribution when num_gpu>=2 is tested.HSA_OVERRIDE_GFX_VERSION=9.0.0; this is an override-assisted configuration, not a standard deployment.| Layer | Intervention | Effect |
|---|---|---|
| L1: Distribution / Build | Inspect support matrix, presets, target filters, installed artifacts | Locate where gfx900 is blocked/allowed at the build layer |
| L2: API Semantics | Trace num_gpu from client → server → runner → llama.cpp | Establish correct meaning: offloaded layers, not device count |
| L3: Runtime Comparison | Matched ROCm (:11435) vs Vulkan (:11434) — same hardware, same model, same workload | Isolate backend-specific failure in these runs |
| L4: Evidence Capture | result.json, journal, backend_probe, rocm-smi, ollama ps | Reproducible claims with run_id and file-path linkage |
Principle: falsifiable diagnosis over optimistic one-off success.
| Component | Specification |
|---|---|
| GPU | AMD Radeon RX Vega 56 (gfx900), 8 GB HBM2 |
| OS / Kernel | EndeavourOS, Kernel 6.12.74-1-lts |
| ROCm endpoint | Ollama 0.17.5 (03-07) / 0.17.6 (03-08) via :11435, libggml-hip.so |
| Vulkan endpoint | Ollama 0.17.4 via :11434, libggml-vulkan.so |
| Model / workload | qwen3.5:2b, NUM_PREDICT=512, num_gpu=0,1,2,-1 |
| ROCm override | HSA_OVERRIDE_GFX_VERSION=9.0.0 (ROCm service only; Vulkan service: no override applied) |
HSA_OVERRIDE_GFX_VERSION=9.0.0). This supports conditional executability in this setup — not official support restoration. The Vulkan service ran without any override.| num_gpu | ROCm (:11435) run_20260307_012643 [override-assisted] | Vulkan (:11434) run_20260307_013050 [no override] |
|---|---|---|
| 0 | OK (46.689s, eval_count=512) | OK (45.002s, eval_count=512) |
| 1 | OK (48.742s) | HTTP 500 + SIGSEGV (2.672s) |
| 2 | OK (47.748s) | HTTP 500 + SIGSEGV (7.640s) |
| -1 | OK (44.265s) | HTTP 500 + SIGSEGV (9.816s) |
Matched rerun on 2026-03-08: ROCm run_20260308_201305 (0.17.6, override-assisted) again passed all tested values; Vulkan run_20260308_201641 again failed for num_gpu=1/2/-1 with nearly identical crash times (2.70 / 7.64 / 9.81s).
Observation (in these matched qwen3.5:2b runs): Vulkan failure is conditional on num_gpu >= 1, not universal. ROCm/HIP passed all tested values under override-assisted configuration on both days.
| Run | num_gpu | status | elapsed_sec | eval_duration(ns) | response_chars |
|---|---|---|---|---|---|
| run_20260307_012643 (ROCm) | 0 | ok | 46.689 | 42016540155 | 0 |
| run_20260307_012643 (ROCm) | 1 | ok | 48.742 | 43853670072 | 1809 |
| run_20260307_012643 (ROCm) | 2 | ok | 47.748 | 42307474936 | 1951 |
| run_20260307_012643 (ROCm) | -1 | ok | 44.265 | 42084598410 | 1849 |
| run_20260307_013050 (Vulkan) | 0 | ok | 45.002 | 41778665097 | 0 |
| run_20260307_013050 (Vulkan) | 1/2/-1 | error | 2.672 / 7.640 / 9.816 | — | — |
| run_20260308_201305 (ROCm) | 0/1/2/-1 | all ok | 47.98 / 49.18 / 48.30 / 45.17 | 512 each | normal completion |
| run_20260308_201641 (Vulkan) | 0 | ok | 45.53 | 512 | normal completion |
| run_20260308_201641 (Vulkan) | 1/2/-1 | error | 2.70 / 7.64 / 9.81 | — | — |
| run_20260308_201924 (Vulkan) | 0 × 5 epochs | 5/5 ok | 50.03 / 42.46 / 43.18 / 42.67 / 43.07 | 512 each | no crash |
Data source: vega_work_log/*/result.json | 16 matched-record rows extracted from the full log set.
| Timestamp (JST) | Event | Meaning |
|---|---|---|
| 01:31:37 | load_backend: libggml-vulkan.so loaded; "int dot: 0 | matrix cores: none" | Backend init completed; hardware feature caps confirmed |
| 01:31:38 | offloaded 1/25 layers to GPU (Vulkan0); compute graph Vulkan0: 257.1 MiB | Model load and GPU allocation completed normally |
| 01:31:39 | llama runner started in 1.72 seconds | Runner startup succeeded |
| 01:31:39 | SIGSEGV: segmentation violation (PC=0x7efbc5b801b5, addr=0x3a8000) | Runner process crash — first inference batch, not during load |
| 01:31:39 | ggml_backend_sched_graph_compute_async → computeBatch(runner.go:716) | Crash in compute graph scheduling path, via cgo |
| 01:31:39 | server.go: post predict … EOF | API side sees runner termination as EOF → HTTP 500 |
| 01:31:47 / 01:31:57 | Same sequence repeats for num_gpu=2 and -1 | Consistent pattern within these runs for all GPU-offload cases |
Source: vega_work_log/run_20260307_013050/ollama_journal_since_start.txt
From: vega_work_log/run_20260307_013050/ollama_journal_since_start.txt — immediately after runner reported ready
SIGSEGV: segmentation violation
PC=0x7efbc5b801b5 m=41 sigcode=1 addr=0x3a8000
signal arrived during cgo execution
goroutine 853 [syscall]:
runtime.cgocall(0x55e54a0e2710, 0xc0000bdaa0)
runtime/cgocall.go:167
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(...)
_cgo_gotypes.go:979
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(...)
ml/backend/ggml/ggml.go:825 +0x1b2
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(...)
runner/ollamarunner/runner.go:716 +0x862
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
runner/ollamarunner/runner.go:459
Crash point: C function ggml_backend_sched_graph_compute_async called via cgo. Go runtime receives SIGSEGV and terminates the runner. The exact Vulkan API call within that C function is not resolved from this trace alone.
| Test | Run / Condition | Observed result | Interpretation |
|---|---|---|---|
| Model swap | run_20260308_212345, Vulkan, MODEL=phi4-mini:latest, NUM_GPU=0,1 | num_gpu=0: ok (18.82s, 318 tok); num_gpu=1: ok (31.55s, 512 tok) | Offload crash is not universal on gfx900 Vulkan; it is model/workload-dependent. |
| Token-count control | run_20260308_212254, Vulkan, qwen3.5:2b, NUM_PREDICT=128, NUM_GPU=1 | error at 2.858s (vs mean 2.653s at NUM_PREDICT=512) | Crash timing is largely independent of requested output length. |
| Timing statistics | 3 qwen3.5:2b Vulkan runs, num_gpu>=1 | num_gpu=1: 2.653s ±0.059s; 2: 7.630s ±0.014s; -1: 9.819s ±0.016s | Failure timing is highly deterministic, not obviously a random race. |
Source: run_20260308_212345, run_20260308_212254, and analysis_summary.md derived from 15 runs / 119 records in vega_work_log/.
| Code location | Content | Impact on gfx900 |
|---|---|---|
ggml-vulkan.cpp:250-252 | enum vk_device_architecture { OTHER, AMD_GCN, AMD_RDNA1, AMD_RDNA2, AMD_RDNA3 } | gfx900 → AMD_GCN branch |
ggml-vulkan.cpp:296-297 | if (maxSubgroupSize==64 && minSubgroupSize==64) return AMD_GCN; | GCN wave64-only → classified AMD_GCN |
ggml-vulkan.cpp:3964 | use_subgroups = subgroup_arithmetic && arch != AMD_GCN | Subgroup arithmetic disabled for gfx900 |
ggml-vulkan.cpp:3241-3243 | AMD_GCN + open-source driver → altered warptile: {256,64,64,32,16,16,2,2,2,1,16} | Different matrix multiply tile parameters from RDNA |
ggml-vulkan.cpp:3953-3957 | AMD_GCN: rm_stdq=2, rm_kq=4, rm_stdq_int=4 | Different row-reduce parameters |
Journal evidence (01:31:37):
ggml_vulkan: 0 = AMD Radeon RX Vega (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
int dot: 0 = VK_KHR_shader_integer_dot_product not hardware-accelerated. matrix cores: none = no cooperative matrix. Both are consistent with source-level AMD_GCN code paths.
| Point | File / Line | What it establishes |
|---|---|---|
| 1–2 | ollama/_types.py:104-110, _client.py:281-305 | num_gpu is a load-time option forwarded as-is to /api/generate; no GPU-count semantics in client |
| 3–4 | ollama/api/types.go:600-608, 1071-1076 | NumGPU belongs to model-load options; default -1 means dynamic policy, not "use all GPUs" |
| 5 | ollama/cmd/interactive.go:112 | CLI defines num_gpu explicitly as "the number of layers to send to the GPU" |
| 6–7 | ollama/llm/server.go:992, 1063-1076 | NumGPU is passed as requestedLayers; capped at len(layers) — confirms layer semantics |
| 8 | runner/llamarunner/runner.go:906-924 | NumGpuLayers accumulated from layer counts per device — not device count |
| 9–11 | llama/llama.go:264-267, include/llama.h:289 | n_gpu_layers: "number of layers to store in VRAM; negative value means all layers" |
| Hypothesis | Expected evidence | Observed (in these runs) |
|---|---|---|
| Vega is generally unusable | Both backends fail on same GPU | Contradicted: ROCm/HIP succeeded in these runs |
| num_gpu means GPU count | Failure implies multi-GPU entry | Contradicted: code trace establishes layer-count semantics |
| Model-/workload-dependent Vulkan compute-path issue on AMD_GCN | qwen3.5:2b fails at first compute batch via cgo, while phi4-mini completes at num_gpu=1 | Consistent with observed evidence |
| Tier | Statement |
|---|---|
| 1. Official support | gfx900 is not officially supported in ROCm 7.2 or ollama's default build targets. |
| 2. Conditional executability | In this environment, with HSA_OVERRIDE_GFX_VERSION=9.0.0 and pre-existing artifacts, ROCm/HIP ran successfully across all tested num_gpu values on two matched-run days. |
| 3. Shipped artifacts | The ROCm 7.2 package ships gfx900 Perf DB (169 K lines), 128 rocBLAS pre-compiled kernels, and 16 firmware blobs — exceeding gfx1100 (RDNA 3) and gfx1030 (RDNA 2) on multiple metrics. This implies active build-pipeline inclusion. |
| 4. Backend-dependent stability | Under matched qwen3.5:2b conditions across two days, Vulkan SIGSEGV'd for num_gpu>=1 while ROCm/HIP did not; Vulkan remained stable at num_gpu=0, including a 5-epoch run. A phi4-mini follow-up completed at num_gpu=1, indicating model dependence. |
rocBLAS Pre-compiled Files
MIOpen Perf DB (top variant, ×1 K lines)
| Artifact | Path | Role |
|---|---|---|
| Primary metrics | vega_work_log/run_*/result.json | Status, latency, error_type per num_gpu case |
| Crash phase logs | vega_work_log/run_*/ollama_journal_since_start.txt | Full stack trace and EOF timing |
| Backend identity | vega_work_log/run_*/backend_probe.txt | Confirms ROCm vs Vulkan service path, device recognition |
| Aggregate analysis | analyze_runs.py + analysis_summary.md | Full-log aggregation over 15 runs / 119 records; crash timing statistics and pivots |
| Code semantics trace | work_log/investigations/2026-03-07_numgpu_semantics_trace.md | Line-level evidence for num_gpu semantics |
| Gate analysis | work_log/investigations/2026-03-07_gfx900_gate_matrix.md | Layered block/allow model for gfx900 |
Replay command: OLLAMA_HOST=http://127.0.0.1:11435|11434 EPOCHS=1 NUM_PREDICT=512 NUM_GPU=0,1,2,-1 python vega-loop_qwen_rocm.py
Stability check: OLLAMA_HOST=http://127.0.0.1:11434 EPOCHS=5 NUM_GPU=0 NUM_PREDICT=512 python vega-loop_qwen.py
Limitations:
HSA_OVERRIDE_GFX_VERSION=9.0.0; outcome without the override is not tested here, and the override-off check was deferred because it requires service restart and recovery handling.ggml_backend_sched_graph_compute_async (C/Vulkan side) is not resolved from the Go stack trace alone.Future work: Additional models and quantization formats; matched Ollama version pair; same-version override-off vs override-on comparison; deeper Vulkan trace (RADV validation layers); broader multi-epoch stress under GPU-offload settings.
gfx900 is absent from the ROCm 7.2 support matrix and ollama's default HIP build targets. This is a factual statement about vendor policy, not execution capability.
In this environment, under override-assisted configuration, ROCm/HIP ran across all tested num_gpu values on both matched-run days. Five of six examined gate layers retained executable paths.
ROCm 7.2 ships gfx900 Perf DB (169 K lines), 128 rocBLAS kernels, and 16 firmware blobs — exceeding RDNA 3/2 on multiple metrics. This implies active build-pipeline inclusion, not passive code survival.
For qwen3.5:2b across two matched runs, Vulkan SIGSEGV'd for num_gpu>=1 while ROCm/HIP did not; Vulkan stayed stable at num_gpu=0 for 5/5 epochs. A phi4-mini follow-up passed at num_gpu=1, so the failure is not universal to gfx900 Vulkan.
Based on the collected evidence: "not officially supported" and "not executable in any configuration" are distinct statements. The shipped-artifacts finding adds a fourth dimension: vendor packaging actively includes gfx900 in ways that surpass some supported architectures.
llvm-project-private#389 の内容を推定で補完するものではない / The content of the private issue has been inferred.