Skip to content

LLM Providers

pytest-llm-report supports multiple LLM providers for generating test annotations.

Provider: none (default)

No LLM calls. Reports contain test results and coverage only.

[tool.pytest_llm_report]
provider = "none"

Provider: ollama

Local LLM using Ollama.

Setup

  1. Install Ollama
  2. Pull a model: ollama pull llama3.2
  3. Configure:
[tool.pytest_llm_report]
provider = "ollama"
model = "llama3.2"
ollama_host = "http://127.0.0.1:11434"
Model Size Speed Quality
llama3.2 2GB Fast Good
qwen2.5-coder:7b 4GB Medium Better
qwen2.5-coder:14b 8GB Slow Best

Note: Ollama automatically requests JSON-formatted output for compatible models, improving parsing reliability.

Provider: litellm

Cloud LLMs via LiteLLM.

Setup

  1. Set API key:

    export OPENAI_API_KEY="sk-..."
    # or
    export ANTHROPIC_API_KEY="sk-ant-..."
    

  2. Configure:

    [tool.pytest_llm_report]
    provider = "litellm"
    model = "gpt-4o-mini"
    

Supported models

  • OpenAI: gpt-4o-mini, gpt-4o
  • Anthropic: claude-3-haiku-20240307, claude-3-5-sonnet-20241022
  • Many more via LiteLLM

Note: OpenAI-compatible models automatically use response_format: json_object for structured output.

Using a LiteLLM Proxy Server

For corporate environments with a LiteLLM AI proxy:

[tool.pytest_llm_report]
provider = "litellm"
model = "gpt-4o-mini"
litellm_api_base = "https://proxy.corp.com/v1"
litellm_api_key = "your-static-key"  # Optional, if not using env var

Dynamic Token Refresh

If your proxy requires tokens that expire (e.g., OIDC/Okta tokens):

[tool.pytest_llm_report]
provider = "litellm"
model = "gpt-4o-mini"
litellm_api_base = "https://proxy.corp.com/v1"
litellm_token_refresh_command = "your-token-cli get-token"
litellm_token_refresh_interval = 3300  # Refresh before 60m expiry

[!WARNING] For security reasons, the refresh command is executed directly, not via a shell (shell=False). This means you cannot use pipes (|), redirection (>), or environment variable expansion ($VAR) directly in the command string.

If you need complex logic (e.g., piping output), create a wrapper script:

  1. Create get_token.sh (and chmod +x it):

    #!/bin/bash
    # Example: Get a raw token and remove the trailing newline
    gcloud auth print-access-token | tr -d '\n'
    

  2. Configure the plugin to use it:

    litellm_token_refresh_command = "./get_token.sh"
    

Configuration Options

Option Default Description
litellm_api_base None Custom proxy URL
litellm_api_key None Static API key override
litellm_token_refresh_command None CLI command to get fresh token
litellm_token_refresh_interval 3300 Seconds before token refresh (55 min)
litellm_token_output_format "text" Output parsing: "text" or "json"
litellm_token_json_key "token" JSON key when format is "json"

Token Command Requirements

  • Output token to stdout (logs can go to stderr)
  • Exit code 0 on success
  • For "text" format: last non-empty line is the token
  • For "json" format: stdout is JSON, token extracted from specified key

Automatic 401 Retry

If a request fails with 401 (expired token), the plugin automatically: 1. Invalidates the cached token 2. Fetches a new token using the refresh command 3. Retries the request once

Provider: gemini

Cloud LLMs via the Gemini API.

Setup

  1. Set API key:

    export GEMINI_API_TOKEN="..."
    

  2. Configure:

    [tool.pytest_llm_report]
    provider = "gemini"
    model = "gemini-1.5-flash-latest"
    

Supported models

  • gemini-2.5-flash
  • gemini-2.5-pro
  • gemini-2.0-flash-exp
  • gemini-2.0-flash
  • gemini-2.0-flash-001
  • gemini-2.0-flash-exp-image-generation
  • gemini-2.0-flash-lite-001
  • gemini-2.0-flash-lite
  • gemini-2.0-flash-lite-preview-02-05
  • gemini-2.0-flash-lite-preview
  • gemini-exp-1206
  • gemini-2.5-flash-preview-tts
  • gemini-2.5-pro-preview-tts
  • gemini-flash-latest
  • gemini-flash-lite-latest
  • gemini-pro-latest
  • gemini-2.5-flash-lite
  • gemini-2.5-flash-image-preview
  • gemini-2.5-flash-image
  • gemini-2.5-flash-preview-09-2025
  • gemini-2.5-flash-lite-preview-09-2025
  • gemini-3-pro-preview
  • gemini-3-flash-preview
  • gemini-3-pro-image-preview
  • nano-banana-pro-preview
  • gemini-robotics-er-1.5-preview
  • gemini-2.5-computer-use-preview-10-2025
  • deep-research-pro-preview-12-2025
  • gemma-3-1b-it
  • gemma-3-4b-it
  • gemma-3-12b-it
  • gemma-3-27b-it
  • gemma-3n-e4b-it
  • gemma-3n-e2b-it

Note: Gemini models use response_schema with response_mime_type: application/json for guaranteed structured output.

Rate limits

When Gemini is enabled, pytest-llm-report queries the Gemini model metadata to retrieve the requests-per-minute (RPM), tokens-per-minute (TPM), and requests-per-day (RPD) limits for the selected model. The plugin applies those limits automatically:

  • RPM/TPM: waits until the next request is within the limit.
  • RPD: skips annotation once the daily cap is reached (no waiting).

Model rotation

If you specify model = "all" or a comma-separated list of models, the plugin will automatically rotate between available models to maximize request throughput:

[tool.pytest_llm_report]
provider = "gemini"
model = "gemini-2.5-flash,gemini-2.0-flash,gemini-1.5-flash"

When a model reaches its rate limit, the plugin switches to the next available model. This is especially useful for exceeding free-tier daily limits by distributing requests across multiple models.

Model recovery

For long-running test sessions (e.g., CI jobs spanning multiple days), models that hit their daily request limits will automatically recover after 24 hours. The plugin tracks when each model was exhausted and clears that state once the daily limit window has passed.

Additionally, the available model list is refreshed every 6 hours to pick up any new models that may have become available via the Gemini API.

Caching

LLM responses are cached to reduce API calls:

[tool.pytest_llm_report]
cache_dir = ".pytest_llm_cache"
llm_cache_ttl_seconds = 86400  # 24 hours

Clear cache:

rm -rf .pytest_llm_cache