なぜ直接観測できないのか Why Direct Observation Is Blocked

このページで得られる理解：RX9070XT + Ollama の環境では「どの kernel が今動いているか」を直接読む手段が現時点で存在しない、という事実の確認と、その理由の記録。

What you'll gain here: the confirmed fact that no means currently exists to directly read which kernel is active in RX9070XT + Ollama, along with a record of why each approach fails.

このページが追う問いは一つ。RX9070XT 上の Ollama で、どの GPU kernel が使われているかを直接読める手段は存在するのか？
答えは「現時点では No」。wrap も attach も layer も、すべて機能しない。消去法の末に残るのは phase proxy（response JSON）だけ。
これは C14 portability pilot が「直接観測」ではなく「多層証拠の積み上げ」で進む理由の核心でもある。

One question drives this page: is there any means to directly read which GPU kernel is active on Ollama running on RX9070XT?
The answer is currently No. Wrap, attach, and layer logging all fail. What remains by elimination is only the phase proxy (response JSON).
This is also the core reason why the C14 portability pilot proceeds via multi-layer inference rather than direct observation.

驚きポイント： 通常の profiling 手段（wrap・attach・layer log）を使おうとすると、GPU 自体が動かなくなるか、OS に止められるか、見たいものが見えないかのいずれかになる。つまり 「何が動いているか」は、標準的な手段では読めない。 The surprising part: Every standard profiling approach (wrap, attach, layer log) either causes the GPU to stop being used, is blocked by the OS, or lacks sufficient granularity to see what matters. In short, there is no standard means to read "what is running."

gfx1201 C14 pilot 直接 observer: 全滅 Direct observers: all blocked phase proxy: 唯一の有効手段 phase proxy: only viable method

このページの結論 Page Conclusion

試した直接 observer はすべて機能しない（wrap → CPU fallback、--attach → ptrace ブロック、ROCBLAS_LAYER → 粒度不足）。
消去法の末に残るのは phase proxy（response JSON の prompt_eval_count）だけ。これは「優れた手段を選んだ」ではなく、「他がすべて機能しない状況で唯一生き残った手段」。
この事実が、C14 pilot が直接観測ではなく多層証拠の積み上げで進む理由を直接説明する。 All direct observer approaches tried are non-functional (wrap → CPU fallback, --attach → ptrace blocked, ROCBLAS_LAYER → insufficient granularity).
By elimination, only the phase proxy (prompt_eval_count from response JSON) remains. This was not chosen for its quality — it is the last approach standing after all others failed.
This is the direct explanation for why the C14 pilot proceeds via multi-layer inference rather than direct observation.

最近の source trace でさらに分かったこと：current Ollama path は llama.cpp の default LLAMA_SPLIT_MODE_LAYER を維持していて、ROW-mode split buffer を source 上 request していない。
そのため runtime の graph splits = 2 は split-buffer mode の直接証拠ではなく、 current phase-proxy 読みでは observer 制約 と source-side branch facts を分けて扱う必要がある。 One more thing recent source tracing clarified: the current Ollama path keeps llama.cpp at the default LLAMA_SPLIT_MODE_LAYER and does not request ROW-mode split buffers in source.
So runtime graph splits = 2 is not direct evidence of split-buffer mode, and the current phase-proxy reading must explicitly separate observer constraints from source-side branch facts.

やさしく言うと——たとえ話でまず理解する In Plain Terms — Start With an Analogy

GPU の中でどの「仕事」が動いているかを見たい。でも問題がある。のぞこうとすると、機械そのものが止まる。

We want to see which "job" is running inside the GPU. But there's a problem: if you try to look, the machine stops running.

いろんな方法を試した。結果を順に言う：

We tried several approaches. Here's what happened:

🔭 カメラを設置しようとした（rocprofv3 wrap）
→ カメラを置いた瞬間、GPU が「ない」と誤認されて止まった。カメラを置く前に終わった。
🚪 別の入口からこっそり入ろうとした（--attach）
→ 扉に鍵がかかっていた（OS のセキュリティ設定）。鍵を外せば入れるが、それは「普通の環境」ではなくなる。
🪟 外から小さな窓を通して見た（ROCBLAS_LAYER）
→ 「部屋が使われている」ことは分かった。でも「どの仕事をしているか」は窓が小さすぎて見えなかった。
📊 部屋から出てくる「数字のメーター」を読んだ（phase proxy）
→ 中には入れないが、外に出てくる数字を読んで「たぶんこっちの仕事をしている」と推理できた。
→ しかも、数字が変わる条件（プロンプトの長さ）を自分で操作できる。

🔭 Tried to set up a camera (rocprofv3 wrap)
→ The moment the camera was placed, the GPU was misdetected as absent and stopped. It ended before anything was recorded.
🚪 Tried to sneak in through another door (--attach)
→ The door was locked (OS security setting). Unlocking it would work, but that would no longer be a "standard environment."
🪟 Looked through a small window from outside (ROCBLAS_LAYER)
→ We could tell "the room is in use." But the window was too small to see which job was running.
📊 Read the "meter" coming out of the room (phase proxy)
→ Can't enter, but by reading numbers that come out, we can infer "it's probably doing this job."
→ And we can deliberately control the condition that changes the number (prompt length).

これは「あきらめた」ではなく「その環境で唯一有効な科学的手段を選んだ」。外から読める数字を手がかりに内部を推理する——これは実験科学として普通のやり方。地震計も、MRI も、温度計も、みんな「間接的に内部を読む」手段。 This is not "giving up" — it is "choosing the only scientifically valid approach available in that environment." Inferring the interior from externally readable signals is standard experimental science. Seismometers, MRIs, thermometers — all read the interior indirectly.

Observer 制約マップ（全手段の比較） Observer Constraint Map (all approaches)

Observer constraint map — rocprofv3 wrap Invalid, --attach Blocked, ROCBLAS_LAYER Partial, phase proxy Valid

クリックで拡大 · 2026-03-28 時点での各 observer の状態

Click to enlarge · Observer status as of 2026-03-28

手段① — 観測ツールをかぶせて起動する（rocprofv3 wrap）
結果：GPU が止まった（CPU fallback） Approach 1 — Launch with observation tool wrapped (rocprofv3 wrap)
Result: GPU stopped (CPU fallback)

観測ポイント： Ollama を起動するとき、外から GPU 観測ツールをかぶせて一緒に動かせば、「どの仕事が GPU に割り当てられたか」を記録できるか？

Observation target: If we launch Ollama with a GPU observation tool wrapped around it from the start, can we record which jobs get dispatched to the GPU?

最も直接的な手段として試みた。rocprofv3（AMD GPU の標準観測ツール。GPU にどんな仕事が投げられたかをログに残せる）を使って、ollama serve（Ollama をサーバーとして起動するコマンド）を「かぶせた状態で」起動した。つまり「Ollama を単独で動かす」のではなく、「観測ツールに見張らせながら起動する」やり方を試みた。

This was the most direct approach. We used rocprofv3 (AMD's standard GPU observation tool — it can log which jobs are sent to the GPU) and launched ollama serve (the command that starts Ollama as a server) with the observation tool wrapped around it. Rather than running Ollama standalone, we tried running it while being watched from the start.

何が起きたか：What happened: 観測ツールがかぶさっていることで、Ollama の「起動時に GPU を見つけて接続する準備処理（bootstrap discovery）」が失敗した。 GPU が認識されないまま、Ollama は GPU を使うのをあきらめて CPU で動き始めた（CPU fallback）。つまり、「GPU の仕事を記録する」より前に「GPU 自体が動かない」状態になってしまった。 The presence of the observation tool caused Ollama's "startup process of finding and connecting to the GPU (bootstrap discovery)" to fail. Without recognizing the GPU, Ollama gave up on GPU use and started running on CPU instead (CPU fallback). In other words, before any GPU job could be recorded, the GPU itself had stopped being used.

ここから分かること： この実行で得られたデータは「観測ツールが状況を変えてしまった」ものとして扱う。CPU で動いた記録は、RX9070XT での GPU 経路を観測したことにならない。
→ 次の問い： 「最初からかぶせる」のではなく、「起動後に後からくっつける」やり方なら経路を壊さずに記録できるか？

What this tells us: Data from this run is treated as "observation changed the situation." CPU-mode execution does not count as observing the GPU path on the RX9070XT.
→ Next question: Rather than wrapping from the start, can we attach after launch without disrupting the path?

手段② — 起動後に後からくっつける（rocprofv3 --attach）
結果：OS に止められた（ptrace ブロック） Approach 2 — Attach to already-running process (rocprofv3 --attach)
Result: Blocked by OS (ptrace blocked)

観測ポイント： Ollama を先に普通に起動し、動き始めたあとで観測ツールを後からつなげれば、起動処理を壊さずに GPU の仕事を記録できるか？

Observation target: Start Ollama normally first, then connect the observation tool after it's running — can we record GPU jobs without disrupting startup?

手段①では「最初からかぶせる」ことで起動処理が壊れた。ならば、Ollama を先に起動して GPU の準備が終わってから、後で観測ツールをくっつければいい——そう考えて試みた。
専門的に言うと：rocprofv3 --attach（動いているプロセスに後からアタッチするオプション）を使おうとした。

Approach 1 broke startup by wrapping from the beginning. So the idea was: start Ollama first, wait until GPU setup is done, then attach the observation tool afterward.
Technically: we tried using rocprofv3 --attach (the option to attach to an already-running process).

何が起きたか：What happened: OS のセキュリティ設定（ptrace_scope = 1）によって、「別のプログラムに後からくっつく」操作そのものが拒否された。
ptrace とは「OS が、別のプログラムの中をのぞいたり制御したりする許可」を管理する仕組み。この設定が「1（制限あり）」だと、無関係なプロセスからのくっつきは断られる。
設定を変更すれば動く可能性はあるが、それはシステムのセキュリティ条件を変えることを意味し、「普通の環境での観測」ではなくなる。 An OS security setting (ptrace_scope = 1) blocked the act of "attaching to another program after the fact" entirely.
ptrace is the mechanism the OS uses to manage permission for "looking inside or controlling another program." When set to 1 (restricted), attachments from unrelated processes are refused.
Changing this setting might allow it to work, but that means changing a system security condition — it would no longer be "observation in a normal environment."

ここから分かること： 後からくっつける方式は OS の設定に依存しており、標準的な環境では使えない。設定を変えて使うことは選択肢から外した。
→ 次の問い： rocprofv3 を使わずに、行列計算ライブラリの呼び出しだけを記録する方法はあるか？

What this tells us: The attach approach depends on OS settings and is unavailable in a standard environment. Changing the setting to make it work was ruled out.
→ Next question: Without using rocprofv3, is there a way to log just the matrix library calls?

手段③ — 行列計算ライブラリの呼び出しだけ記録する（ROCBLAS_LAYER=9）
結果：「使われた」は分かったが「何をしたか」は分からなかった（粒度不足） Approach 3 — Log just the matrix library calls (ROCBLAS_LAYER=9)
Result: Confirmed "it was used" but not "what it did" (insufficient granularity)

観測ポイント： GPU 全体を観測するのはあきらめて、行列計算専用ライブラリ（rocBLAS）への呼び出しだけをログに残せるか？

Observation target: Rather than observing the whole GPU, can we at least log calls made to the matrix computation library (rocBLAS)?

手段①②がどちらも動かせない状況で、もっと範囲を絞ることにした。LLM の推論では「行列のかけ算（GEMM）」が計算の大部分を占めており、そこを担当するのが rocBLAS（AMD 製の行列計算ライブラリ）。
専門的に言うと：ROCBLAS_LAYER=9 という環境変数を設定すると、rocBLAS への API 呼び出しをログとして出力できる。この方法はプロセス自体を壊さずに使える可能性があった。

With approaches 1 and 2 both unavailable, we narrowed the scope. In LLM inference, "matrix multiplication (GEMM)" dominates the computation, and rocBLAS (AMD's matrix computation library) handles that.
Technically: setting the environment variable ROCBLAS_LAYER=9 outputs a log of API calls made to rocBLAS, and this approach might work without disrupting the running process.

何が起きたか：What happened: 「rocBLAS が起動した（rocblas_create_handle が呼ばれた）」ことは確認できた。しかし、その後の行列計算ひとつひとつについてのログは出なかった。
つまり「この部屋は使われた」は分かったが、「誰が何の作業を何回やったか」は見えなかった——窓が小さすぎた。 We confirmed "rocBLAS started up (rocblas_create_handle was called)." But there was no log for each individual matrix computation afterward.
In short: we could tell "this room was in use," but not "who did what work, how many times" — the window was too small.

ここから分かること： 「行列計算ライブラリにアクセスがあった」ことは確認できた。ただし、LLM の Q4_K の処理は多くの場合、rocBLAS を経由しない独自カーネル（MMVQ/MMQ）が担っており、そのルートには ROCBLAS_LAYER の窓が届かない。rocBLAS が使われる「BLAS fallback」という別ルートに入らない限り、何も見えない。
→ 次の問い： この 3 つをすべて除いた後に、何か残るものはあるか？

What this tells us: We confirmed "the matrix library was accessed." However, Q4_K processing in LLM inference mostly takes a custom kernel path (MMVQ/MMQ) that doesn't go through rocBLAS, and ROCBLAS_LAYER can't see that route. It only sees anything when the "BLAS fallback" route is taken.
→ Next question: After ruling out all three of these, is there anything left?

手段④ — 「返ってきた数字」を手がかりに推理する（phase proxy）
消去法の末に残った唯一の手段 Approach 4 — Infer from the "number that comes back" (phase proxy)
The last approach standing

観測ポイント： Ollama の動作を一切壊さずに、「GPU の中でどの処理ルートを通ったか」をある程度推理できる手段があるか？

Observation target: Without touching Ollama's runtime at all, is there any way to roughly infer which processing route was taken inside the GPU?

手段①〜③がすべて使えない状況で、発想を変えた。
ソースコードを読み返すと、GPU の中での「どのカーネル（仕事の担当者）を使うか」の分岐は、行列の列数（ne11）で決まっている。そしてこれは、「プロンプトに含まれるトークン数（単語の数のようなもの）」とほぼ一致する。
ならば、Ollama が返す response JSON の中に含まれる prompt_eval_count（プロンプトを何トークン処理したか）を読むことで、ne11 の代わりに使えないか？
専門的に言うと：これが「phase proxy（間接的な近似指標）」の発想。

With approaches 1–3 all blocked, we changed our thinking.
Re-reading the source code: the branch that decides "which kernel (worker) to use" inside the GPU is determined by the number of matrix columns (ne11). And this corresponds closely to "the number of tokens (roughly, words) in the prompt."
So — could we read prompt_eval_count (how many prompt tokens were processed) from Ollama's response JSON and use it as a stand-in for ne11?
Technically: this is the idea behind "phase proxy (an indirect approximate indicator)."

何が起きたか（Phase F probe 結果）：What happened (Phase F probe results): prompt_eval_count = 7 / 8 → MMVQ 圏（ne11 ≤ 8 に対応）。 prompt_eval_count = 9 / 11 / 12 → MMQ-eligible 圏（8 < ne11 ≤ 256 に対応）。 runtime は壊れず、GPU 上で正常に実行された。 prompt_eval_count = 7 / 8 → MMVQ zone (maps to ne11 ≤ 8). prompt_eval_count = 9 / 11 / 12 → MMQ-eligible zone (maps to 8 < ne11 ≤ 256). The runtime was not disrupted; execution proceeded on GPU normally.

実際にやっていること（手順）：

① 単純な word ladder prompt を投げる（例：6語, 7語, 8語, 10語, 11語）
② response JSON の prompt_eval_count を読む → 7 / 8 / 9 / 11 / 12
③ 7 / 8 は ne11 ≤ 8 → MMVQ 圏 と読む
④ 9 / 11 / 12 は 8 < ne11 ≤ 256 → MMQ 圏 と読む

→ runtime は壊れず、GPU 上で正常実行。しかも `8 -> 9` の境界そのもの を外から踏み分けられた。 Concrete steps (what you actually do):

① Send a simple word-ladder prompt (e.g., 6, 7, 8, 10, 11 words)
② Read prompt_eval_count from response JSON → 7 / 8 / 9 / 11 / 12
③ Read 7 / 8 as MMVQ-side because ne11 ≤ 8
④ Read 9 / 11 / 12 as MMQ-side because 8 < ne11 ≤ 256

→ Runtime intact, GPU running normally. We can now straddle the `8 -> 9` transition itself.

さらに source から詰めると：
RDNA4 では `ggml_cuda_should_use_mmq(...)` が `Q4_K` で source-side true になり、 non-split の branch order は MMVQ -> MMQ -> BLAS fallback の順で評価される。
なので current best reading は、`9 / 11 / 12` は「MMQ-eligible」なだけでなく、source 上も `mul_mat_q` 側へ寄る。
ただしこれは direct dispatch proof ではなく、live case の split-buffer predicate 自体はまだ runtime から直接読めていない。 Tightening it from source as well:
On RDNA4, `ggml_cuda_should_use_mmq(...)` is source-side true for `Q4_K`, and the non-split branch order is evaluated as MMVQ -> MMQ -> BLAS fallback.
So the current best reading is that `9 / 11 / 12` is not merely MMQ-eligible, but also source-leaning toward `mul_mat_q`.
This is still not direct dispatch proof, because the live case's split-buffer predicate is not directly exported by runtime logs.

追加で binary 側から絞れたこと：
prompt-side の `Q4_K` owner 候補の local bundle neighborhood を spot check すると、 Q4_1 -> Q4_K -> Q5_0 -> Q5_1 -> Q5_K -> Q6_K -> Q8_0 -> Flash と並んで見える。
つまり少なくともこの近傍には、第2の nearby `Q4_K` MMQ owner は見えていない。
まだ direct dispatch proof ではないが、`9 / 11 / 12` を既知の `Q4_K` artifact に寄せて読む根拠は一段強くなった。 One more thing narrowed from the binary side:
A spot check of the local bundle neighborhood around the prompt-side `Q4_K` owner candidate reads as Q4_1 -> Q4_K -> Q5_0 -> Q5_1 -> Q5_K -> Q6_K -> Q8_0 -> Flash.
So at least in this local neighborhood, no second nearby `Q4_K` MMQ owner is visible.
This is still not direct dispatch proof, but it makes the `9 / 11 / 12` reading toward the known `Q4_K` artifact materially stronger.

なぜこれが研究として面白いのか：
phase proxy は「読む」だけでなく、「分岐を意図的に踏ませる操作手段」でもある。
プロンプトの長さを変えることで、GPU 内の分岐を自分でコントロールできる。
「見えないものを外から操作して反応を見る」——これは実験科学の基本。
direct observer が全滅したからこそ、この方法の「操作可能性」が光る。 Why this is interesting as research:
Phase proxy is not just "reading" — it is also "a controllable tool for deliberately triggering a branch."
By varying prompt length, we can directly control which internal GPU branch runs.
"Controlling the invisible from outside and watching its response" — this is the basis of experimental science.
Precisely because all direct observers failed, this approach's controllability stands out.

ここから分かること： phase proxy は「dispatch の直接確認」ではなく「dispatch 圏の近似推定」。それでも、runtime を壊さずに dispatch 圏を区別できる現時点で唯一の手段であることが確認できた。

What this tells us: The phase proxy is not direct dispatch confirmation — it is an approximate inference of the dispatch zone. However, it has been confirmed as the only currently available means to distinguish dispatch zones without disrupting the runtime.

この制約のもとで示せること / 示せないこと What Can and Cannot Be Shown Under These Constraints

示せることCan Show	示せないことCannot Show
rocprofv3 wrap が CPU fallback を引き起こすこと（observer 汚染の確認） rocprofv3 wrap causes CPU fallback (observer contamination confirmed)	CPU fallback 時の実行内容（計測対象が変わっている） What ran during CPU fallback (the measurement target has changed)
ptrace_scope=1 が --attach をブロックすること ptrace_scope=1 blocks --attach	ptrace_scope を下げた環境での attach の挙動（調査環境の前提を変えてしまう） Attach behavior with ptrace_scope lowered (alters the investigation baseline)
rocBLAS の handle が作成されていること（ROCBLAS_LAYER=9 で確認） rocBLAS handle is created (confirmed via ROCBLAS_LAYER=9)	per-GEMM の dispatch 帰属（ROCBLAS_LAYER では粒度不足） Per-GEMM dispatch attribution (ROCBLAS_LAYER lacks sufficient granularity)
phase proxy で dispatch 圏の近似を読めること Phase proxy can approximate the dispatch zone	kernel launch 時の正確な ne11 値（response JSON からは直接取れない） Exact ne11 value at kernel launch (not directly readable from response JSON)

調査フェーズ設計の意図（C14 pilot: Phase A–F） Investigation Phase Design Rationale (C14 Pilot: Phase A–F)

observer が軒並みブロックされる状況を受けて、調査は「直接観測」ではなく「多層証拠の積み上げ」で進めることになった。 Phase A–F はその積み上げの設計。各層で「何を固定したか」を確認できる。

Facing blocked direct observers, the investigation shifted to building multi-layer evidence rather than direct observation. Phase A–F is the design for that process — each phase fixes a specific layer.

Phase A

ソース footing。dispatch 分岐の閾値（ne11 ≤ 8 / ≤ 256）をソースで確定。

Source footing. Fix dispatch thresholds (ne11 ≤ 8 / ≤ 256) in source.

Phase B

呼び出しチェーン確定。Ollama → ggml → hipBLAS / custom の経路をソースで追跡。

Fix call chain. Trace Ollama → ggml → hipBLAS / custom path through source.

Phase C

ライブラリロード確認。/proc/maps で 12 ライブラリのロードを確認。

Library load confirmation. Verify 12 libraries loaded via /proc/maps.

Phase D

binary 目録。fatbin 内の gfx1201 hsaco を列挙し、MMVQ/MMQ/BLAS の binary が存在することを確認。

Binary inventory. Enumerate gfx1201 hsacos in fatbin; confirm MMVQ/MMQ/BLAS binaries exist.

Phase E

逆アセンブリ。Q4_K 候補 bundle を isolate し、register footprint・命令パターンを確定。

Disassembly. Isolate Q4_K candidate bundles; fix register footprints and instruction patterns.

Phase F

Phase proxy probe。runtime 相関を phase proxy で取得。現在の frontier。

Phase proxy probe. Obtain runtime correlation via phase proxy. Current frontier.

次の方法論ステップ — 量子化横断 dispatch map Next Method Step — Quantization-Agnostic Dispatch Map

direct observer がまだ blocked のままなので、次に広げるべきなのは workload を無秩序に増やすことではなく、 source 側で tensor type から dispatch family への地図を先に作ること。ここでは、いまの Q4_K_M case study と、量子化方式をまたぐ一般 map を意図的に分ける。

With direct observers still blocked, the next expansion is not to add workloads indiscriminately, but to build a source-side map from tensor type to dispatch family first. We intentionally separate the current Q4_K_M case study from the broader cross-quantization map.

ここで重要なのは、file type と tensor type が同じではないこと。 `Q4_K_M` というラベルを持つモデルでも、実際の tensor 側では `Q4_K` だけでなく `Q5_K` / `Q6_K` / `Q8_0` が混ざりうる。 The key point is that file type and tensor type are not the same. Even a model labeled Q4_K_M can contain tensor-side mixes such as `Q4_K`, `Q5_K`, `Q6_K`, and `Q8_0`.

見る層Layer to map	今回固定したことWhat is now fixed
量子化ラベルQuantization label	`Q4_K_M` のような file type は model-level label であり、tensor-level reality を全部は表さない。 A file type such as `Q4_K_M` is a model-level label and does not fully describe tensor-level reality.
matmul 分岐Matmul split	quantized matmul は source 上で `MMVQ (≤ 8)` / `MMQ` / BLAS fallback に分かれる。 Quantized matmul splits in source into `MMVQ (≤ 8)`, `MMQ`, and BLAS fallback.
隣接 opAdjacent ops	`get_rows`、dequant/convert、RoPE、attention は量子化 coverage が揃わないので、同一 map と見なさない。 `get_rows`, dequant/convert, RoPE, and attention do not share one uniform quantization coverage map.

この一般 map を広げる最初の別枝としては IQ1_S を先に置く。理由は、source 上で `MMVQ` / `MMQ` の両方に entry があり、しかも既存の RDNA4 fatbin には `bundle_0082` という `gfx1201` `mul_mat_q` anchor がすでに見えているから。さらに practical path を分解すると、`ollama create -q IQ1_S` のような requantize route は狭いが、existing GGUF の direct import path は別経路で残っている。`Q5_K_M` はその次に回す。

The first branch to broaden this general map with is IQ1_S. The reason is that it already has source-side entry into both `MMVQ` and `MMQ`, and the current RDNA4 fatbin already exposes a `gfx1201` `mul_mat_q` anchor in `bundle_0082`. Practical-path analysis also split further: the `ollama create -q IQ1_S` requantize route is narrow, while direct import of an existing GGUF appears to remain a separate path. `Q5_K_M` comes after that.

follow-up では、この practical split を実機で確認した。small single-file `IQ1_S` GGUF (`Llama-3.2-1B-Instruct-UD-IQ1_S.gguf`) を `Modelfile` から direct import すると、 `ollama create -> show -> api/generate -> ollama ps` まで current host で成立し、resident case では `100% GPU` が見えた。つまり current Ollama では、`IQ1_S` は「requantize helper は狭いが、existing GGUF direct import は通る」と書くのが正確。 A follow-up practical check confirmed this split on real hardware. A small single-file `IQ1_S` GGUF (`Llama-3.2-1B-Instruct-UD-IQ1_S.gguf`) successfully passed through `ollama create -> show -> api/generate -> ollama ps` on the current host, and the resident case showed `100% GPU`. So for current Ollama, the accurate statement is: `IQ1_S` has a narrow requantize helper path, but existing-GGUF direct import does work.

さらに bounded probe では、`IQ1_S` も `Q4_K` に近い二段構えで読めた。 binary 側では `bundle_0030` に `IQ1_S` MMVQ (`ggml_type19 1..8`)、`bundle_0082` に `IQ1_S` MMQ (`ggml_type19 8..128`) があり、 live prompt window でも `prompt_eval_count = 7` は `MMVQ-compatible`、`13` は `MMQ-eligible` と読むのが自然。まだ exact dispatch proof ではないが、`IQ1_S` も source / binary / phase proxy の三層で narrowing できるところまで来た。 A bounded probe also suggests that `IQ1_S` can be read in a two-stage pattern similar to `Q4_K`. On the binary side, `bundle_0030` carries `IQ1_S` MMVQ (`ggml_type19 1..8`) while `bundle_0082` carries `IQ1_S` MMQ (`ggml_type19 8..128`). On the live side, `prompt_eval_count = 7` reads as `MMVQ-compatible`, while `13` reads as `MMQ-eligible`. This is still not exact dispatch proof, but it means `IQ1_S` can now be narrowed across source, binary, and phase-proxy layers.

その次の近縁比較対象としての `Q5_K_M` も、もう footing だけではない。 small single-file `Q5_K_M` GGUF の direct import は current host で通り、 `ollama create -> show -> api/generate -> ollama ps` まで成立した。 bounded probe では short window が `prompt_eval_count = 7`、medium window が `13` で、resident runner も `100% GPU` だった。さらに `Q5_K` MMQ 側も `bundle_0098` として exact extracted `gfx1201` hsaco まで取れた。なので `13` window は source だけでなく exact object 側からも `MMQ-eligible` と読める。ただし extracted object の存在と live dispatch の実証は別なので、exact runtime ownership はまだ unresolved。 The next near-neighbor comparison target, `Q5_K_M`, is no longer just source/binary footing. A small single-file `Q5_K_M` GGUF now passes through `ollama create -> show -> api/generate -> ollama ps` on the current host. In the bounded probe, the short window stayed at `prompt_eval_count = 7`, the medium window reached `13`, and the resident runner remained `100% GPU`. The `Q5_K` MMQ side has now also been isolated as exact extracted `gfx1201` hsaco in `bundle_0098`. That means the `13` window is `MMQ-eligible` not only from source shape, but also from exact object presence. Even so, extracted-object presence and live dispatch proof are different stages, so exact runtime ownership remains unresolved.

ここまでで比較の三本柱も固まった。現在の整理は、 `Q4_K_M` = main case study、 `Q5_K_M` = もっとも近い near-neighbor、 `IQ1_S` = 最初の branch broadener、という役割分担である。 3 本とも source skeleton、exact extracted MMVQ/MMQ anchor、practical ingest/runtime viability、bounded phase proxy までは揃った。つまり次の未確定は「object があるか」ではなく、「live window をどれが支配したか」である。 At this point the three-way comparison frame is also fixed. The current division is: `Q4_K_M` = main case study, `Q5_K_M` = closest near-neighbor, `IQ1_S` = first branch broadener. All three now have source skeletons, exact extracted MMVQ/MMQ anchors, practical ingest/runtime viability, and bounded phase-proxy narrowing. So the next uncertainty is no longer "does the object exist?" but rather "which object owned the live window?"

次の観測点 Where to Look Next

dispatch-safe observer が得られたら： ptrace_scope の制約が解除された環境、または rocprofv3 の将来バージョンで bootstrap を壊さない attach が可能になった場合、per-dispatch kernel attribution を再試行する。
optimization フェーズへの移行条件： dispatch-safe observer の確保が前提。それなしに最適化を始めると、改善を証明できないだけでなく退化も見逃す。
observer が blocked な間の次善策： `Q4_K_M` ケース研究を維持しつつ、direct-import と bounded phase proxy まで進んだ `IQ1_S` / `Q5_K_M` を別枝比較対象として並べ、そのうえで `tensor type -> MMVQ / MMQ / BLAS fallback` の量子化横断 dispatch map を広げる。
このページを読んだ後は： 経路観測ページ（どの kernel が存在し、phase proxy で何を読んだか）と調査概観（C14 portability pilot 全体像）。

When a dispatch-safe observer becomes available: If ptrace_scope restrictions are lifted, or a future version of rocprofv3 allows bootstrap-safe attach, retry per-dispatch kernel attribution.
Prerequisite for optimization phase: A dispatch-safe observer must be secured first. Starting optimization without it means you cannot prove improvements — and cannot detect regressions.
Best next move while observers remain blocked: Keep the `Q4_K_M` case study, use the now-confirmed `IQ1_S` and `Q5_K_M` direct-import plus bounded-phase-proxy cases as alternative branches, then broaden the source-side `tensor type -> MMVQ / MMQ / BLAS fallback` map before adding more workloads.
After this page: See the path observation page (which kernels exist and what the phase proxy read) and the investigation overview (C14 portability pilot framing).

掲載情報は観測記録に基づきます。「示せないこと」欄は調査の欠如ではなく observer 制約による限界です。 Content is grounded in observation logs. "Cannot Show" entries reflect observer constraints, not gaps in investigation effort.

なぜ直接観測できないのか Why Direct Observation Is Blocked

このページの結論 Page Conclusion

やさしく言うと——たとえ話でまず理解する In Plain Terms — Start With an Analogy

Observer 制約マップ（全手段の比較） Observer Constraint Map (all approaches)

手段① — 観測ツールをかぶせて起動する（rocprofv3 wrap）結果：GPU が止まった（CPU fallback） Approach 1 — Launch with observation tool wrapped (rocprofv3 wrap)Result: GPU stopped (CPU fallback)

手段② — 起動後に後からくっつける（rocprofv3 --attach）結果：OS に止められた（ptrace ブロック） Approach 2 — Attach to already-running process (rocprofv3 --attach)Result: Blocked by OS (ptrace blocked)

手段④ — 「返ってきた数字」を手がかりに推理する（phase proxy）消去法の末に残った唯一の手段 Approach 4 — Infer from the "number that comes back" (phase proxy)The last approach standing

この制約のもとで示せること / 示せないこと What Can and Cannot Be Shown Under These Constraints

調査フェーズ設計の意図（C14 pilot: Phase A–F） Investigation Phase Design Rationale (C14 Pilot: Phase A–F)

次の方法論ステップ — 量子化横断 dispatch map Next Method Step — Quantization-Agnostic Dispatch Map

次の観測点 Where to Look Next

手段① — 観測ツールをかぶせて起動する（rocprofv3 wrap）
結果：GPU が止まった（CPU fallback） Approach 1 — Launch with observation tool wrapped (rocprofv3 wrap)
Result: GPU stopped (CPU fallback)

手段② — 起動後に後からくっつける（rocprofv3 --attach）
結果：OS に止められた（ptrace ブロック） Approach 2 — Attach to already-running process (rocprofv3 --attach)
Result: Blocked by OS (ptrace blocked)

手段④ — 「返ってきた数字」を手がかりに推理する（phase proxy）
消去法の末に残った唯一の手段 Approach 4 — Infer from the "number that comes back" (phase proxy)
The last approach standing