Code Trace: What `num_gpu` Really Means Code Trace: `num_gpu` の本当の意味

Standalone supplementary deck with exact line references 厳密な行番号付きの単体補助資料

11 pages total: 1 overview + 9 source points + 1 conclusion 全11ページ: 概要1 + ソース根拠9 + まとめ1

Question: does `num_gpu` mean GPU count, or layer-offload count? 問い: `num_gpu` は GPU 枚数なのか、それとも offload 層数なのか？

1

Trace Map and Final Answerトレース全体図と最終結論

Trace chain 追跡チェーン

Pointポイント	File / linesファイル / 行	What it says何を言っているか
1	_types.py:104-110	`num_gpu` exists as a load-time option`num_gpu` は load-time option として存在
2	_client.py:281-305	client forwards `options` unchangedclient は `options` を意味変換せず転送
3	api/types.go:600-608, 1071-1076	Go side stores `NumGPU`; default `-1` means dynamicGo 側で `NumGPU` を保持; 既定 `-1` は動的決定
4	cmd/interactive.go:108-114	CLI literally says “number of layers”CLI が文字通り「layers」と説明
5	llm/server.go:992, 1063-1078	server treats it as `requestedLayers`server は `requestedLayers` として扱う
6	runner.go:906-925	runner sums layer countsrunner は層数を合計する
7	llama.go:260-267	bridge writes it into `n_gpu_layers`bridge が `n_gpu_layers` に書き込む
8	llama.h:286-291	public API defines it as layers in VRAM公開 API が VRAM に置く層数と定義
9	common.h:378-384	supplemental semantics: `-1 auto`, `<= -2 all`補足意味: `-1 auto`, `<= -2 all`

flowchart TD P1["① _types.py\nnum_gpu as load-time option"] P2["② _client.py\nforwards options unchanged"] P3["③ api/types.go\nNumGPU int, default -1"] P4["④ CLI help text\n'number of layers to offload to GPU'"] P5["⑤ server.go\ntreats it as requestedLayers"] P6["⑥ runner.go\nsums layer counts"] P7["⑦ llama.go\nwrites to n_gpu_layers"] P8["⑧ llama.h public header\nlayers to store in VRAM"] P9["⑨ common.h semantics\n-1=auto, le-2=all layers"] P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> P7 --> P8 --> P9 P9 --> CONCL["num_gpu = layer count to offload\nNOT number of GPU devices"] style P4 fill:#fff8cc,stroke:#aaaa44 style P8 fill:#fff8cc,stroke:#aaaa44 style CONCL fill:#ddeeff,stroke:#3366aa,color:#003

Final answer: `num_gpu` is not the number of GPUs. It is the number of model layers to offload to GPU/VRAM. Therefore, `num_gpu=2` does not mean “two GPUs were used.” 最終結論: `num_gpu` は GPU 枚数ではない。GPU/VRAM に offload するモデル層数である。したがって、`num_gpu=2` は「GPUを2枚使った」ことを意味しない。

Easy readingやさしい読み方

Think of `num_gpu` as “how many thinking blocks move to GPU.”`num_gpu` は「何個の考えるブロックを GPU に移すか」と考える。
This deck proves the meaning by reading the code from client to llama.cpp.この資料は client から llama.cpp までコードを順に読んで意味を証明する。

2

Point 1 — Python type definitionポイント1 — Python 型定義

tank/docs-ref/llama/ollama-python/ollama/_types.py:104-110

104 | class Options(SubscriptableBaseModel):
105 |   # load time options
106 |   numa: Optional[bool] = None
107 |   num_ctx: Optional[int] = None
108 |   num_batch: Optional[int] = None
109 |   num_gpu: Optional[int] = None
110 |   main_gpu: Optional[int] = None

Strict reading: `num_gpu` appears inside `Options`, under the comment `# load time options`. At this point it is just an integer field. No text here says “GPU count.” 厳密な読み: `num_gpu` は `Options` 内、`# load time options` の下に置かれている。この時点では単なる整数フィールドであり、「GPU枚数」とは書かれていない。

Simple wordsやさしい言い方

The Python client has a box named `num_gpu`.Python クライアントには `num_gpu` という箱がある。
But the box label does not yet tell us what the number means.ただし、その数字の意味はまだここだけでは分からない。

This page only proves existence and category: `num_gpu` is a load-time option. このページで証明できるのは存在と分類だけ: `num_gpu` は load-time option である。

3

Point 2 — Python client sends it unchangedポイント2 — Python クライアントは意味変換せず送る

tank/docs-ref/llama/ollama-python/ollama/_client.py:281-305

281 | return self._request(
282 |   GenerateResponse,
283 |   'POST',
284 |   '/api/generate',
285 |   json=GenerateRequest(
... |
296 |     images=list(_copy_images(images)) if images else None,
297 |     options=options,
298 |     keep_alive=keep_alive,
299 |     width=width,
300 |     height=height,
301 |     steps=steps,
302 |   ).model_dump(exclude_none=True),

Strict reading: the client inserts `options=options` into the JSON request body for `/api/generate`. No remapping like `num_gpu -> gpu_count` happens here. 厳密な読み: client は `/api/generate` の JSON 本文に `options=options` をそのまま入れている。ここで `num_gpu -> gpu_count` のような意味変換は起きていない。

Simple wordsやさしい言い方

The client is just a messenger.client はただの配達人。
So if we want the true meaning, we must keep reading the server side.だから本当の意味を知るには server 側を読み続ける必要がある。

Client code does not define semantics; it only forwards the field. client code は意味を定義しない。フィールドを転送するだけである。

4

Point 3 — Go API type and default valueポイント3 — Go API の型と既定値

tank/docs-ref/llama/ollama/api/types.go:600-608, 1071-1076

600 | // Runner options which must be set when the model is loaded into memory
601 | type Runner struct {
602 |     NumCtx    int   `json:"num_ctx,omitempty"`
603 |     NumBatch  int   `json:"num_batch,omitempty"`
604 |     NumGPU    int   `json:"num_gpu,omitempty"`
605 |     MainGPU   int   `json:"main_gpu,omitempty"`
...
1071 | Runner: Runner{
1072 |   // options set when the model is loaded
1073 |   NumCtx:    int(envconfig.ContextLength()),
1074 |   NumBatch:  512,
1075 |   NumGPU:    -1, // -1 here indicates that NumGPU should be set dynamically

Strict reading: `NumGPU` belongs to `Runner` options used when the model is loaded into memory. Its default is `-1`, and the comment explicitly says this means dynamic setting, not a fixed GPU-device count. 厳密な読み: `NumGPU` はモデルをメモリにロードするときの `Runner` option に属する。既定値は `-1` で、コメントはこれを「動的に設定する」意味だと明示している。固定GPU台数ではない。

Simple wordsやさしい言い方

The server keeps the field as `NumGPU`.server 側でも名前は `NumGPU` のまま残る。
`-1` means “decide automatically,” so the number behaves like a policy/control value.`-1` は「自動で決める」なので、この値は方針や配分を決めるための値として振る舞う。

The default already points toward offload policy semantics, not GPU-count semantics. 既定値の意味だけでも、GPU枚数より offload 方針の意味に近いことが分かる。

5

Point 4 — CLI help gives the clearest wordingポイント4 — CLI ヘルプが最も明確に言い切る

tank/docs-ref/llama/ollama/cmd/interactive.go:108-114

108 | fmt.Fprintln(os.Stderr, "  /set parameter num_ctx           Set the context size")
109 | fmt.Fprintln(os.Stderr, "  /set parameter temperature     Set creativity level")
110 | fmt.Fprintln(os.Stderr, "  /set parameter repeat_penalty  How strongly to penalize repetitions")
111 | fmt.Fprintln(os.Stderr, "  /set parameter repeat_last_n     Set how far back to look for repetitions")
112 | fmt.Fprintln(os.Stderr, "  /set parameter num_gpu           The number of layers to send to the GPU")
113 | fmt.Fprintln(os.Stderr, "  /set parameter stop   ...   Set the stop parameters")

Strict reading: the product's own CLI help text explicitly defines `num_gpu` as “The number of layers to send to the GPU.” This is direct semantic evidence. 厳密な読み: 製品自身の CLI ヘルプ文言が `num_gpu` を「GPUへ送る層数」と明示している。これは直接的な意味証拠である。

Simple wordsやさしい言い方

This is the strongest plain-English clue.これは一番分かりやすい直接証拠。
If the program authors say “layers,” we should not reinterpret it as “GPUs.”作者が「layers」と言っているなら、「GPUs」と読み替えてはいけない。

This page alone almost settles the question. このページだけでも問いはほぼ決着する。

6

Point 5 — Server code uses `requestedLayers`ポイント5 — Server code は `requestedLayers` として使う

tank/docs-ref/llama/ollama/llm/server.go:992, 1063-1078

992  | libraryGpuLayers := assignLayers(layers, gl, requireFull, s.options.NumGPU, lastUsedGPU)
...
1063 | func assignLayers(layers []uint64, gpus []ml.DeviceInfo, requireFull bool, requestedLayers int, lastUsedGPU int) (gpuLayers ml.GPULayersList) {
1064 |   // If the user is manually overriding parameters, treat all GPUs equally so they split according to VRAM
1065 |   if requestedLayers >= 0 || envconfig.SchedSpread() {
...
1073 |   // requestedLayers may be -1 if nothing was requested
1074 |   requestedLayers = min(len(layers), requestedLayers)

Strict reading: `s.options.NumGPU` is passed into `assignLayers()` and immediately named `requestedLayers`. Then it is capped by `min(len(layers), requestedLayers)`. This is unmistakably layer-count logic. 厳密な読み: `s.options.NumGPU` は `assignLayers()` に渡され、直ちに `requestedLayers` と名付けられる。その後 `min(len(layers), requestedLayers)` で上限が掛けられる。これは明らかに層数ロジックである。

Simple wordsやさしい言い方

The server treats the number as “how many layers are requested.”server はこの数を「何層ほしいか」として扱っている。
If the model has only so many layers, the number is clipped to that limit.モデルの層数を超える場合は、その最大値までに切り詰める。

This is implementation-level proof, not just help text. これはヘルプ文言ではなく、実装レベルの証拠である。

7

Point 6 — Runner sums layer countsポイント6 — Runner は層数を合計する

tank/docs-ref/llama/ollama/runner/llamarunner/runner.go:906-925

906 | numGPU := 0
907 | var tensorSplit []float32
908 | var llamaIDs []uint64
...
912 | for _, layers := range req.GPULayers {
913 |   for i := range gpuIDs {
914 |     if gpuIDs[i].DeviceID == layers.DeviceID {
915 |       numGPU += len(layers.Layers)
916 |       tensorSplit = append(tensorSplit, float32(len(layers.Layers)))
917 |       llamaIDs = append(llamaIDs, gpuIDs[i].LlamaID)
...
922 | params := llama.ModelParams{
923 |   Devices:      llamaIDs,
924 |   NumGpuLayers: numGPU,

Strict reading: `numGPU` is calculated by `numGPU += len(layers.Layers)`. That means the variable is an accumulated layer count. It is then written into `NumGpuLayers`. 厳密な読み: `numGPU` は `numGPU += len(layers.Layers)` で計算されている。つまりこの変数は累積層数である。その値が `NumGpuLayers` に書かれる。

Simple wordsやさしい言い方

Here the program literally counts layers, not cards.ここでは文字通り、GPUカード枚数ではなく層を数えている。
The variable name `numGPU` is a bit misleading, but the math shows the real meaning.変数名 `numGPU` は少し紛らわしいが、計算内容が本当の意味を示している。

Behavior matters more than the variable name: the behavior is layer counting. 変数名より実際の振る舞いが重要であり、その振る舞いは層数カウントである。

8

Point 7 — Go/C bridge writes to `n_gpu_layers`ポイント7 — Go/C ブリッジは `n_gpu_layers` に書き込む

tank/docs-ref/llama/ollama/llama/llama.go:260-267

260 | callback(float32(progress))
261 | return true
262 | }
263 |
264 | func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) {
265 |   cparams := C.llama_model_default_params()
266 |   cparams.n_gpu_layers = C.int(params.NumGpuLayers)
267 |   cparams.main_gpu = C.int32_t(params.MainGpu)

Strict reading: the runner-side value `params.NumGpuLayers` is copied directly into the C struct field `n_gpu_layers`. No reinterpretation happens at the bridge. 厳密な読み: runner 側の `params.NumGpuLayers` は C 構造体の `n_gpu_layers` に直接コピーされる。ブリッジ地点で意味の読み替えは起きない。

Simple wordsやさしい言い方

The baton is passed without changing the label.バトンは名前も意味も変えずに次へ渡される。
So the final meaning must be checked in llama.cpp itself.だから最後の意味は llama.cpp 本体で確認すればよい。

Bridge code preserves the layer-count semantics. ブリッジコードは層数の意味を保ったまま渡している。

9

Point 8 — llama.cpp public header is explicitポイント8 — llama.cpp 公開ヘッダが明示する

tank/docs-ref/llama/llama.cpp/include/llama.h:286-291

286 | // NULL-terminated list of buffer types to use for tensors that match a pattern
287 | const struct llama_model_tensor_buft_override * tensor_buft_overrides;
288 |
289 | int32_t n_gpu_layers; // number of layers to store in VRAM, a negative value means all layers
290 | enum llama_split_mode split_mode; // how to split the model across multiple GPUs

Strict reading: the official llama.cpp header defines `n_gpu_layers` as “number of layers to store in VRAM.” The next field, `split_mode`, is the one that talks about multiple GPUs. 厳密な読み: llama.cpp の公式ヘッダは `n_gpu_layers` を「VRAMに置く層数」と定義している。複数GPUについて語るのは、その次の `split_mode` フィールドである。

Simple wordsやさしい言い方

This is the clearest contract in the whole chain.これはチェーン全体で最も強い定義文。
Layers and multi-GPU are written as different things in different fields.層数と multi-GPU は、別のフィールドに別の概念として書かれている。

Public API documentation closes the argument: `num_gpu` maps to layer count. 公開API文書が議論を閉じる: `num_gpu` は層数に対応する。

10

Point 9 — Supplemental semantics for negative valuesポイント9 — 負値の意味を補足する実装コメント

tank/docs-ref/llama/llama.cpp/common/common.h:378-384

378 | // offload params
379 | std::vector<ggml_backend_dev_t> devices; // devices to use for offloading
380 |
381 | int32_t n_gpu_layers       = -1;   // number of layers to store in VRAM, -1 is auto, <= -2 is all
382 | int32_t main_gpu           = 0;    // the GPU that is used for scratch and small tensors
383 | float   tensor_split[128]  = {0};  // how split tensors should be distributed across GPUs

Strict reading: this common parameter struct repeats the same meaning and adds the policy for negative values: `-1` means auto, `<= -2` means all layers. Again, this is layer-offload policy, not device count. 厳密な読み: この共通パラメータ構造体は同じ意味を繰り返し、さらに負値の方針を補う: `-1` は auto、`<= -2` は全層。これも GPU 枚数ではなく、層 offload 方針の意味である。

Simple wordsやさしい言い方

Negative values are special commands, not “minus GPUs.”負の値は特別な指示であって、「マイナス枚のGPU」ではない。
`-1` says “decide automatically,” and stronger negative values can mean “put all layers there.”`-1` は「自動で決める」、さらに小さい負値は「全層を載せる」を意味しうる。

The negative-value behavior only makes sense if the parameter means layer count. 負値の振る舞いは、このパラメータが層数を意味すると考えたときにだけ自然に理解できる。

11

Conclusion — Why this matters for the experimentまとめ — これが実験解釈で重要な理由

Misreading誤読	Why it is wrongなぜ誤りか	Correct reading正しい読み
`num_gpu=2` means two GPUs`num_gpu=2` は GPU 2枚	all traced code points speak about layers / VRAM, not GPU count追跡した全コード点が layers / VRAM を語っており、GPU枚数ではない	offload two layers to GPU2層をGPUへ offload
`num_gpu=-1` means use all GPUs`num_gpu=-1` は全GPU使用	comments define `-1` as dynamic / autoコメントが `-1` を dynamic / auto と定義	automatic offload decisionoffload 方針を自動決定

Experimental consequence: when Vulkan crashes at `num_gpu>=1`, the careful claim is not “multi-GPU failed.” The careful claim is: “Vulkan failed when one or more model layers were offloaded to GPU in this tested setup.” 実験上の帰結: Vulkan が `num_gpu>=1` で落ちたとき、慎重な主張は「multi-GPU が失敗した」ではない。正しくは「この実験条件では、1層以上をGPUへ offload したときに Vulkan が失敗した」である。

Easy explanationやさしい解説

If you misread the knob, you will misread the crash.つまみの意味を読み違えると、クラッシュの意味も読み違える。
This trace protects the poster claim from that mistake.このトレースは、その誤解からポスターの主張を守るための補助資料である。

Nine independent source points converge on one conclusion: `num_gpu` controls layer offload count, not GPU device count. 9つの独立したソース根拠は1つの結論に収束する: `num_gpu` は GPU 枚数ではなく、offload 層数を制御する。