LLM Providers
pytest-llm-report supports multiple LLM providers for generating test annotations.
Provider: none (default)
No LLM calls. Reports contain test results and coverage only.
Provider: ollama
Local LLM using Ollama.
Setup
- Install Ollama
- Pull a model:
ollama pull llama3.2 - Configure:
[tool.pytest_llm_report]
provider = "ollama"
model = "llama3.2"
ollama_host = "http://127.0.0.1:11434"
Recommended models
| Model | Size | Speed | Quality |
|---|---|---|---|
llama3.2 |
2GB | Fast | Good |
qwen2.5-coder:7b |
4GB | Medium | Better |
qwen2.5-coder:14b |
8GB | Slow | Best |
Note: Ollama automatically requests JSON-formatted output for compatible models, improving parsing reliability.
Provider: litellm
Cloud LLMs via LiteLLM.
Setup
-
Set API key:
-
Configure:
Supported models
- OpenAI:
gpt-4o-mini,gpt-4o - Anthropic:
claude-3-haiku-20240307,claude-3-5-sonnet-20241022 - Many more via LiteLLM
Note: OpenAI-compatible models automatically use
response_format: json_objectfor structured output.
Using a LiteLLM Proxy Server
For corporate environments with a LiteLLM AI proxy:
[tool.pytest_llm_report]
provider = "litellm"
model = "gpt-4o-mini"
litellm_api_base = "https://proxy.corp.com/v1"
litellm_api_key = "your-static-key" # Optional, if not using env var
Dynamic Token Refresh
If your proxy requires tokens that expire (e.g., OIDC/Okta tokens):
[tool.pytest_llm_report]
provider = "litellm"
model = "gpt-4o-mini"
litellm_api_base = "https://proxy.corp.com/v1"
litellm_token_refresh_command = "your-token-cli get-token"
litellm_token_refresh_interval = 3300 # Refresh before 60m expiry
[!WARNING] For security reasons, the refresh command is executed directly, not via a shell (
shell=False). This means you cannot use pipes (|), redirection (>), or environment variable expansion ($VAR) directly in the command string.If you need complex logic (e.g., piping output), create a wrapper script:
Create
get_token.sh(andchmod +xit):Configure the plugin to use it:
Configuration Options
| Option | Default | Description |
|---|---|---|
litellm_api_base |
None | Custom proxy URL |
litellm_api_key |
None | Static API key override |
litellm_token_refresh_command |
None | CLI command to get fresh token |
litellm_token_refresh_interval |
3300 | Seconds before token refresh (55 min) |
litellm_token_output_format |
"text" |
Output parsing: "text" or "json" |
litellm_token_json_key |
"token" |
JSON key when format is "json" |
Token Command Requirements
- Output token to stdout (logs can go to stderr)
- Exit code 0 on success
- For
"text"format: last non-empty line is the token - For
"json"format: stdout is JSON, token extracted from specified key
Automatic 401 Retry
If a request fails with 401 (expired token), the plugin automatically: 1. Invalidates the cached token 2. Fetches a new token using the refresh command 3. Retries the request once
Provider: gemini
Cloud LLMs via the Gemini API.
Setup
-
Set API key:
-
Configure:
Supported models
gemini-2.5-flashgemini-2.5-progemini-2.0-flash-expgemini-2.0-flashgemini-2.0-flash-001gemini-2.0-flash-exp-image-generationgemini-2.0-flash-lite-001gemini-2.0-flash-litegemini-2.0-flash-lite-preview-02-05gemini-2.0-flash-lite-previewgemini-exp-1206gemini-2.5-flash-preview-ttsgemini-2.5-pro-preview-ttsgemini-flash-latestgemini-flash-lite-latestgemini-pro-latestgemini-2.5-flash-litegemini-2.5-flash-image-previewgemini-2.5-flash-imagegemini-2.5-flash-preview-09-2025gemini-2.5-flash-lite-preview-09-2025gemini-3-pro-previewgemini-3-flash-previewgemini-3-pro-image-previewnano-banana-pro-previewgemini-robotics-er-1.5-previewgemini-2.5-computer-use-preview-10-2025deep-research-pro-preview-12-2025gemma-3-1b-itgemma-3-4b-itgemma-3-12b-itgemma-3-27b-itgemma-3n-e4b-itgemma-3n-e2b-it
Note: Gemini models use
response_schemawithresponse_mime_type: application/jsonfor guaranteed structured output.
Rate limits
When Gemini is enabled, pytest-llm-report queries the Gemini model metadata to retrieve the requests-per-minute (RPM), tokens-per-minute (TPM), and requests-per-day (RPD) limits for the selected model. The plugin applies those limits automatically:
- RPM/TPM: waits until the next request is within the limit.
- RPD: skips annotation once the daily cap is reached (no waiting).
Model rotation
If you specify model = "all" or a comma-separated list of models, the plugin
will automatically rotate between available models to maximize request throughput:
[tool.pytest_llm_report]
provider = "gemini"
model = "gemini-2.5-flash,gemini-2.0-flash,gemini-1.5-flash"
When a model reaches its rate limit, the plugin switches to the next available model. This is especially useful for exceeding free-tier daily limits by distributing requests across multiple models.
Model recovery
For long-running test sessions (e.g., CI jobs spanning multiple days), models that hit their daily request limits will automatically recover after 24 hours. The plugin tracks when each model was exhausted and clears that state once the daily limit window has passed.
Additionally, the available model list is refreshed every 6 hours to pick up any new models that may have become available via the Gemini API.
Caching
LLM responses are cached to reduce API calls:
Clear cache: