11 pages total: 1 overview + 9 source points + 1 conclusion 全11ページ: 概要1 + ソース根拠9 + まとめ1
Question: does `num_gpu` mean GPU count, or layer-offload count? 問い: `num_gpu` は GPU 枚数なのか、それとも offload 層数なのか?
| Pointポイント | File / linesファイル / 行 | What it says何を言っているか |
|---|---|---|
| 1 | _types.py:104-110 | `num_gpu` exists as a load-time option`num_gpu` は load-time option として存在 |
| 2 | _client.py:281-305 | client forwards `options` unchangedclient は `options` を意味変換せず転送 |
| 3 | api/types.go:600-608, 1071-1076 | Go side stores `NumGPU`; default `-1` means dynamicGo 側で `NumGPU` を保持; 既定 `-1` は動的決定 |
| 4 | cmd/interactive.go:108-114 | CLI literally says “number of layers”CLI が文字通り「layers」と説明 |
| 5 | llm/server.go:992, 1063-1078 | server treats it as `requestedLayers`server は `requestedLayers` として扱う |
| 6 | runner.go:906-925 | runner sums layer countsrunner は層数を合計する |
| 7 | llama.go:260-267 | bridge writes it into `n_gpu_layers`bridge が `n_gpu_layers` に書き込む |
| 8 | llama.h:286-291 | public API defines it as layers in VRAM公開 API が VRAM に置く層数と定義 |
| 9 | common.h:378-384 | supplemental semantics: `-1 auto`, `<= -2 all`補足意味: `-1 auto`, `<= -2 all` |
104 | class Options(SubscriptableBaseModel): 105 | # load time options 106 | numa: Optional[bool] = None 107 | num_ctx: Optional[int] = None 108 | num_batch: Optional[int] = None 109 | num_gpu: Optional[int] = None 110 | main_gpu: Optional[int] = None
281 | return self._request( 282 | GenerateResponse, 283 | 'POST', 284 | '/api/generate', 285 | json=GenerateRequest( ... | 296 | images=list(_copy_images(images)) if images else None, 297 | options=options, 298 | keep_alive=keep_alive, 299 | width=width, 300 | height=height, 301 | steps=steps, 302 | ).model_dump(exclude_none=True),
600 | // Runner options which must be set when the model is loaded into memory
601 | type Runner struct {
602 | NumCtx int `json:"num_ctx,omitempty"`
603 | NumBatch int `json:"num_batch,omitempty"`
604 | NumGPU int `json:"num_gpu,omitempty"`
605 | MainGPU int `json:"main_gpu,omitempty"`
...
1071 | Runner: Runner{
1072 | // options set when the model is loaded
1073 | NumCtx: int(envconfig.ContextLength()),
1074 | NumBatch: 512,
1075 | NumGPU: -1, // -1 here indicates that NumGPU should be set dynamically
108 | fmt.Fprintln(os.Stderr, " /set parameter num_ctxSet the context size") 109 | fmt.Fprintln(os.Stderr, " /set parameter temperature Set creativity level") 110 | fmt.Fprintln(os.Stderr, " /set parameter repeat_penalty How strongly to penalize repetitions") 111 | fmt.Fprintln(os.Stderr, " /set parameter repeat_last_n Set how far back to look for repetitions") 112 | fmt.Fprintln(os.Stderr, " /set parameter num_gpu The number of layers to send to the GPU") 113 | fmt.Fprintln(os.Stderr, " /set parameter stop ... Set the stop parameters")
992 | libraryGpuLayers := assignLayers(layers, gl, requireFull, s.options.NumGPU, lastUsedGPU)
...
1063 | func assignLayers(layers []uint64, gpus []ml.DeviceInfo, requireFull bool, requestedLayers int, lastUsedGPU int) (gpuLayers ml.GPULayersList) {
1064 | // If the user is manually overriding parameters, treat all GPUs equally so they split according to VRAM
1065 | if requestedLayers >= 0 || envconfig.SchedSpread() {
...
1073 | // requestedLayers may be -1 if nothing was requested
1074 | requestedLayers = min(len(layers), requestedLayers)
906 | numGPU := 0
907 | var tensorSplit []float32
908 | var llamaIDs []uint64
...
912 | for _, layers := range req.GPULayers {
913 | for i := range gpuIDs {
914 | if gpuIDs[i].DeviceID == layers.DeviceID {
915 | numGPU += len(layers.Layers)
916 | tensorSplit = append(tensorSplit, float32(len(layers.Layers)))
917 | llamaIDs = append(llamaIDs, gpuIDs[i].LlamaID)
...
922 | params := llama.ModelParams{
923 | Devices: llamaIDs,
924 | NumGpuLayers: numGPU,
260 | callback(float32(progress))
261 | return true
262 | }
263 |
264 | func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) {
265 | cparams := C.llama_model_default_params()
266 | cparams.n_gpu_layers = C.int(params.NumGpuLayers)
267 | cparams.main_gpu = C.int32_t(params.MainGpu)
286 | // NULL-terminated list of buffer types to use for tensors that match a pattern 287 | const struct llama_model_tensor_buft_override * tensor_buft_overrides; 288 | 289 | int32_t n_gpu_layers; // number of layers to store in VRAM, a negative value means all layers 290 | enum llama_split_mode split_mode; // how to split the model across multiple GPUs
378 | // offload params
379 | std::vector<ggml_backend_dev_t> devices; // devices to use for offloading
380 |
381 | int32_t n_gpu_layers = -1; // number of layers to store in VRAM, -1 is auto, <= -2 is all
382 | int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
383 | float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
| Misreading誤読 | Why it is wrongなぜ誤りか | Correct reading正しい読み |
|---|---|---|
| `num_gpu=2` means two GPUs`num_gpu=2` は GPU 2枚 | all traced code points speak about layers / VRAM, not GPU count追跡した全コード点が layers / VRAM を語っており、GPU枚数ではない | offload two layers to GPU2層をGPUへ offload |
| `num_gpu=-1` means use all GPUs`num_gpu=-1` は全GPU使用 | comments define `-1` as dynamic / autoコメントが `-1` を dynamic / auto と定義 | automatic offload decisionoffload 方針を自動決定 |