常见问题

本文内容均由Ollama官方文档翻译，仅供个人学习，如有差异请以官网文档为准（https://ollama.com）ollama.cadn.net.cn

如何升级 Ollama？

macOS 和 Windows 上的 Ollama 将自动下载更新。单击任务栏或菜单栏项，然后单击“重新启动以更新”以应用更新。也可以通过手动下载最新版本来安装更新。ollama.cadn.net.cn

在 Linux 上，重新运行安装脚本：ollama.cadn.net.cn

curl -fsSL https://ollama.com/install.sh | sh

如何查看日志？

查看故障排除文档，了解有关使用日志的更多信息。ollama.cadn.net.cn

我的 GPU 与 Ollama 兼容吗？

请参阅 GPU 文档。ollama.cadn.net.cn

如何指定上下文窗口大小？

默认情况下，Ollama 使用的上下文窗口大小为 2048 个令牌。ollama.cadn.net.cn

要在使用ollama run用/set parameter:ollama.cadn.net.cn

/set parameter num_ctx 4096

使用 API 时，请指定num_ctx参数：ollama.cadn.net.cn

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'

如何判断我的模型是否已加载到 GPU 上？

使用ollama ps命令查看当前加载到内存中的模型。ollama.cadn.net.cn

ollama ps

输出：ollama.cadn.net.cn

NAME          ID              SIZE     PROCESSOR    UNTIL
llama3:70b    bcfb190ca3a7    42 GB    100% GPU     4 minutes from now

这Processor列将显示模型加载到哪个内存：ollama.cadn.net.cn

100% GPU表示模型已完全加载到 GPU 中
100% CPU表示模型完全加载到系统内存中
48%/52% CPU/GPU表示模型已部分加载到 GPU 和系统内存中

如何配置 Ollama 服务器？

Ollama 服务器可以配置环境变量。ollama.cadn.net.cn

在 Mac 上设置环境变量

如果 Ollama 作为 macOS 应用程序运行，则应使用launchctl:ollama.cadn.net.cn

对于每个环境变量，调用launchctl setenv.ollama.cadn.net.cn
```
 launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
```
重新启动 Ollama 应用程序。ollama.cadn.net.cn

在 Linux 上设置环境变量

如果 Ollama 作为 systemd 服务运行，则应使用systemctl:ollama.cadn.net.cn

通过调用systemctl edit ollama.service.这将打开一个编辑器。ollama.cadn.net.cn
对于每个环境变量，添加一行Environment在部分[Service]:ollama.cadn.net.cn
```
 [Service]
 Environment="OLLAMA_HOST=0.0.0.0:11434"
```
保存并退出。ollama.cadn.net.cn
重新加载systemd并重新启动 Ollama：ollama.cadn.net.cn
```
systemctl daemon-reload
systemctl restart ollama
```

在 Windows 上设置环境变量

在 Windows 上，Ollama 会继承您的用户和系统环境变量。ollama.cadn.net.cn

首先，通过在任务栏中单击它来退出 Ollama。ollama.cadn.net.cn
启动设置（Windows 11）或控制面板（Windows 10）应用程序并搜索环境变量。ollama.cadn.net.cn
单击 Edit environment variables（编辑您账户的环境变量）。ollama.cadn.net.cn
为您的用户帐户编辑或创建新变量OLLAMA_HOST,OLLAMA_MODELS等。ollama.cadn.net.cn
单击确定/应用保存。ollama.cadn.net.cn
从 Windows 开始菜单启动 Ollama 应用程序。ollama.cadn.net.cn

如何在代理后面使用 Ollama？

Ollama 从 Internet 中提取模型，可能需要代理服务器才能访问模型。用HTTPS_PROXY通过代理重定向出站请求。确保代理证书作为系统证书安装。有关如何在您的平台上使用环境变量的信息，请参阅上面的部分。ollama.cadn.net.cn

[!注意] 避免设置HTTP_PROXY.Ollama 不使用 HTTP 进行模型拉取，只使用 HTTPS。设置HTTP_PROXY可能会中断客户端与服务器的连接。ollama.cadn.net.cn

如何在 Docker 中的代理后面使用 Ollama？

Ollama Docker 容器镜像可以配置为使用代理，方法是将-e HTTPS_PROXY=https://proxy.example.com启动容器时。ollama.cadn.net.cn

或者，可以将 Docker 守护程序配置为使用代理。有关说明，请参阅 macOS、Windows 和 Linux 上的 Docker Desktop，以及带有 systemd 的 Docker 守护程序。ollama.cadn.net.cn

确保在使用 HTTPS 时将证书安装为系统证书。使用自签名证书时，这可能需要新的 Docker 映像。ollama.cadn.net.cn

FROM ollama/ollama
COPY my-ca.pem /usr/local/share/ca-certificates/my-ca.crt
RUN update-ca-certificates


Build and run this image:ollama.cadn.net.cn
docker build -t ollama-with-ca .
docker run -d -e HTTPS_PROXY=https://my.proxy.example.com -p 11434:11434 ollama-with-ca

Does Ollama send my prompts and answers back to ollama.com?
No. Ollama runs locally, and conversation data does not leave your machine.ollama.cadn.net.cn
How can I expose Ollama on my network?
Ollama binds 127.0.0.1 port 11434 by default. Change the bind address with the OLLAMA_HOST environment variable.ollama.cadn.net.cn
Refer to the section above for how to set environment variables on your platform.ollama.cadn.net.cn
How can I use Ollama with a proxy server?
Ollama runs an HTTP server and can be exposed using a proxy server such as Nginx. To do so, configure the proxy to forward requests and optionally set required headers (if not exposing Ollama on the network). For example, with Nginx:ollama.cadn.net.cn
server {
    listen 80;
    server_name example.com;  # Replace with your domain or IP
    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host localhost:11434;
    }
}

How can I use Ollama with ngrok?
Ollama can be accessed using a range of tools for tunneling tools. For example with Ngrok:ollama.cadn.net.cn
ngrok http 11434 --host-header="localhost:11434"

How can I use Ollama with Cloudflare Tunnel?
To use Ollama with Cloudflare Tunnel, use the --url and --http-host-header flags:ollama.cadn.net.cn
cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"

How can I allow additional web origins to access Ollama?
Ollama allows cross-origin requests from 127.0.0.1 and 0.0.0.0 by default. Additional origins can be configured with OLLAMA_ORIGINS.ollama.cadn.net.cn
Refer to the section above for how to set environment variables on your platform.ollama.cadn.net.cn
Where are models stored?

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models
Windows: C:\Users\%username%\.ollama\models

How do I set them to a different location?
If a different directory needs to be used, set the environment variable OLLAMA_MODELS to the chosen directory.ollama.cadn.net.cn

Note: on Linux using the standard installer, the ollama user needs read and write access to the specified directory. To assign the directory to the ollama user run sudo chown -R ollama:ollama <directory>.ollama.cadn.net.cn

Refer to the section above for how to set environment variables on your platform.ollama.cadn.net.cn
How can I use Ollama in Visual Studio Code?
There is already a large collection of plugins available for VSCode as well as other editors that leverage Ollama. See the list of extensions & plugins at the bottom of the main repository readme.ollama.cadn.net.cn
How do I use Ollama with GPU acceleration in Docker?
The Ollama Docker container can be configured with GPU acceleration in Linux or Windows (with WSL2). This requires the nvidia-container-toolkit. See ollama/ollama for more details.ollama.cadn.net.cn
GPU acceleration is not available for Docker Desktop in macOS due to the lack of GPU passthrough and emulation.ollama.cadn.net.cn
Why is networking slow in WSL2 on Windows 10?
This can impact both installing Ollama, as well as downloading models.ollama.cadn.net.cn
Open Control Panel > Networking and Internet > View network status and tasks and click on Change adapter settings on the left panel. Find the vEthernel (WSL) adapter, right click and select Properties.
Click on Configure and open the Advanced tab. Search through each of the properties until you find Large Send Offload Version 2 (IPv4) and Large Send Offload Version 2 (IPv6). Disable both of these
properties.ollama.cadn.net.cn
How can I preload a model into Ollama to get faster response times?
If you are using the API you can preload a model by sending the Ollama server an empty request. This works with both the /api/generate and /api/chat API endpoints.ollama.cadn.net.cn
To preload the mistral model using the generate endpoint, use:ollama.cadn.net.cn
curl http://localhost:11434/api/generate -d '{"model": "mistral"}'

To use the chat completions endpoint, use:ollama.cadn.net.cn
curl http://localhost:11434/api/chat -d '{"model": "mistral"}'

To preload a model using the CLI, use the command:ollama.cadn.net.cn
ollama run llama3.2 ""

How do I keep a model loaded in memory or make it unload immediately?
By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you're making numerous requests to the LLM. If you want to immediately unload a model from memory, use the ollama stop command:ollama.cadn.net.cn
ollama stop llama3.2

If you're using the API, use the keep_alive parameter with the /api/generate and /api/chat endpoints to set the amount of time that a model stays in memory. The keep_alive parameter can be set to:ollama.cadn.net.cn

a duration string (such as "10m" or "24h")
a number in seconds (such as 3600)
any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
'0' which will unload the model immediately after generating a response

For example, to preload a model and leave it in memory use:ollama.cadn.net.cn
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": -1}'

To unload the model and free up memory use:ollama.cadn.net.cn
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "keep_alive": 0}'

Alternatively, you can change the amount of time all models are loaded into memory by setting the OLLAMA_KEEP_ALIVE environment variable when starting the Ollama server. The OLLAMA_KEEP_ALIVE variable uses the same parameter types as the keep_alive parameter types mentioned above. Refer to the section explaining how to configure the Ollama server to correctly set the environment variable.ollama.cadn.net.cn
The keep_alive API parameter with the /api/generate and /api/chat API endpoints will override the OLLAMA_KEEP_ALIVE setting.ollama.cadn.net.cn
How do I manage the maximum number of requests the Ollama server can queue?
If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded.  You can adjust how many requests may be queue by setting OLLAMA_MAX_QUEUE.ollama.cadn.net.cn
How does Ollama handle concurrent requests?
Ollama supports two levels of concurrent processing.  If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time.  For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing.ollama.cadn.net.cn
If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded.  As prior models become idle, one or more will be unloaded to make room for the new model.  Queued requests will be processed in order.  When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads.ollama.cadn.net.cn
Parallel request processing for a given model results in increasing the context size by the number of parallel requests.  For example, a 2K context with 4 parallel requests will result in an 8K context and additional memory allocation.ollama.cadn.net.cn
The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms:ollama.cadn.net.cn

OLLAMA_MAX_LOADED_MODELS - The maximum number of models that can be loaded concurrently provided they fit in available memory.  The default is 3 * the number of GPUs or 3 for CPU inference.
OLLAMA_NUM_PARALLEL - The maximum number of parallel requests each model will process at the same time.  The default will auto-select either 4 or 1 based on available memory.
OLLAMA_MAX_QUEUE - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512

Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting.  Once ROCm v6.2 is available, Windows Radeon will follow the defaults above.  You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM.ollama.cadn.net.cn
How does Ollama load models on multiple GPUs?
When loading a new model, Ollama evaluates the required VRAM for the model against what is currently available.  If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.  This typically provides the best performance as it reduces the amount of data transferring across the PCI bus during inference.  If the model does not fit entirely on one GPU, then it will be spread across all the available GPUs.ollama.cadn.net.cn
How can I enable Flash Attention?
Flash Attention is a feature of most modern models that can significantly reduce memory usage as the context size grows.  To enable Flash Attention, set the OLLAMA_FLASH_ATTENTION environment variable to 1 when starting the Ollama server.ollama.cadn.net.cn
How can I set the quantization type for the K/V cache?
The K/V context cache can be quantized to significantly reduce memory usage when Flash Attention is enabled.ollama.cadn.net.cn
To use quantized K/V cache with Ollama you can set the following environment variable:ollama.cadn.net.cn

OLLAMA_KV_CACHE_TYPE - The quantization type for the K/V cache.  Default is f16.


Note: Currently this is a global option - meaning all models will run with the specified quantization type.ollama.cadn.net.cn

The currently available K/V cache quantization types are:ollama.cadn.net.cn

f16 - high precision and memory usage (default).
q8_0 - 8-bit quantization, uses approximately 1/2 the memory of f16 with a very small loss in precision, this usually has no noticeable impact on the model's quality (recommended if not using f16).
q4_0 - 4-bit quantization, uses approximately 1/4 the memory of f16 with a small-medium loss in precision that may be more noticeable at higher context sizes.

How much the cache quantization impacts the model's response quality will depend on the model and the task.  Models that have a high GQA count (e.g. Qwen2) may see a larger impact on precision from quantization than models with a low GQA count.ollama.cadn.net.cn
You may need to experiment with different quantization types to find the best balance between memory usage and quality.ollama.cadn.net.cn


    
        
            
             results matching ""
            
            
        
        
            
            No results matching ""