Skip to main content

vLLM

vLLM's OpenAI-compatible server implements /v1/chat/completions, /v1/completions, /v1/responses, and /v1/embeddings. Promptfoo connects to it through the OpenAI provider by changing apiBaseUrl.

Use this page when:

  • vLLM is the model under test
  • vLLM is the local LLM-as-a-judge provider for model-graded assertions such as llm-rubric, g-eval, factuality, answer-relevance, context-*, or select-best
  • your vLLM model returns a separate reasoning field and promptfoo should grade only the final content

Start a vLLM server

vllm serve Qwen/Qwen3-8B \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3-8b \
--api-key token-abc123

Then verify the server directly:

curl http://localhost:8000/v1/chat/completions \
-H 'Authorization: Bearer token-abc123' \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-8b",
"messages": [{"role": "user", "content": "Reply with OK"}],
"max_tokens": 8
}'

apiBaseUrl should be the /v1 root. Promptfoo appends /chat/completions, /completions, or /embeddings depending on the provider type.

For a wiring-only smoke test, a tiny reasoning model such as Qwen/Qwen3-0.6B can verify that promptfoo reaches vLLM and that showThinking behaves correctly. Do not use a tiny model as a real judge; use it only to test the endpoint, parser, and config shape:

vllm serve Qwen/Qwen3-0.6B \
--host 0.0.0.0 \
--port 8000 \
--served-model-name llm_judge \
--reasoning-parser qwen3 \
--max-model-len 4096 \
--api-key token-abc123

Example judge models

Keep --served-model-name short and stable; promptfoo uses that alias in openai:chat:<served-model-name>.

GPT-OSS

For GPT-OSS, use the Hugging Face model name as the vLLM model and expose a short served alias. The example below uses openai/gpt-oss-20b; use a larger GPT-OSS checkpoint the same way when your host has enough memory:

vllm serve openai/gpt-oss-20b \
--host 0.0.0.0 \
--port 8000 \
--served-model-name gpt-oss-20b \
--api-key token-abc123

Prefer a Linux CUDA or ROCm host for GPT-OSS with vLLM. If CPU or ARM serving fails, check the vLLM GPT-OSS recipe for the backend support notes that match your vLLM release.

GLM-4.7

For GLM-4.7, use a vLLM and Transformers combination that supports the exact GLM checkpoint you are serving. The vLLM GLM recipe keeps the install guidance for GLM releases:

vllm serve zai-org/GLM-4.7-FP8 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7 \
--tensor-parallel-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--api-key token-abc123

For zai-org/GLM-4.7-Flash, use a served name such as glm-4.7-flash. If you do not need tool calling, the tool flags are optional for ordinary model-graded assertions; keep --reasoning-parser glm45 when you want vLLM to split reasoning from final content.

Use vLLM as the target model

promptfooconfig.yaml
prompts:
- '{{question}}'

providers:
- id: openai:chat:qwen3-8b
label: vLLM qwen3-8b
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0.2
max_tokens: 512

tests:
- vars:
question: 'What is the capital of France?'
assert:
- type: contains
value: Paris

For completions models, use openai:completion:<served-model-name> instead of openai:chat:<served-model-name>.

You can also put the endpoint in an environment variable:

export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=token-abc123

Use vLLM as an LLM judge

Model-graded assertions call a separate grading provider. Configure that provider under defaultTest.options.provider when every model-graded assertion should use the same vLLM judge:

promptfooconfig.yaml
prompts:
- '{{answer}}'

providers:
# System under test. This can be any provider.
- echo

defaultTest:
options:
provider:
id: openai:chat:llm_judge
label: Judge: llm_judge @ vLLM
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
max_tokens: 10000
showThinking: false

tests:
- vars:
answer: 'Use the Forgot password link and verify by email or SMS.'
assert:
- type: llm-rubric
value: |
Pass if the answer explains how to reset a password and mentions a verification step.

Do not repeat provider: openai:chat:llm_judge on an assertion when the full provider object already lives in defaultTest.options.provider. An assertion-level provider overrides the default provider object, so the apiBaseUrl, apiKey, showThinking, and other config values above will not be used.

If only one assertion should use vLLM, put the full object on that assertion:

assert:
- type: llm-rubric
value: 'Answer gives correct password reset instructions'
provider:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
max_tokens: 10000
showThinking: false

Thinking and reasoning models

When vLLM is started with a reasoning parser, responses may include:

  • message.reasoning_content or message.reasoning: hidden reasoning extracted by vLLM
  • message.content: final answer

Promptfoo's OpenAI-compatible chat provider includes reasoning in the returned output by default:

Thinking: <reasoning>

<content>

That is useful when vLLM is the target model, because assertions can inspect the full visible output. It is usually wrong when vLLM is the judge, because model-graded assertions consume the judge output as the material to parse, embed, classify, or score. If the reasoning text contains JSON-looking scratchpad content, attribution markers, candidate sentences, or numeric choices, promptfoo can use that scratchpad before the final answer.

Set showThinking: false on vLLM judge providers so promptfoo discards reasoning fields and parses only content.

This depends on vLLM successfully splitting the response. If the request stops before the model closes its thinking block, vLLM can return raw <think>... text in message.content instead of reasoning_content. In that case showThinking: false cannot distinguish scratchpad text from final content. Increase the server --max-model-len and provider max_tokens, or disable thinking for judge calls with chat_template_kwargs.enable_thinking: false.

search-rubric is special because it requires web search. A plain vLLM chat server is not a web-search-capable grader; promptfoo will prefer or load a search-capable provider instead. The showThinking guidance applies to the search provider that actually grades the assertion.

This applies to every model-graded assertion that consumes text from the judge. JSON-first metrics can parse scratchpad JSON, RAG metrics can score scratchpad sentences or attribution markers, answer-relevance can embed generated questions with Thinking: prepended, and select-best can read a scratchpad number as the winning index.

Disable thinking at the vLLM API level

showThinking: false only changes what promptfoo reads from the response; the model may still spend tokens thinking. For small local judges and CI smoke tests, disabling thinking is often faster and avoids truncated <think> content. Qwen3 and GLM chat templates support disabling thinking per request through chat_template_kwargs:

defaultTest:
options:
provider:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
showThinking: false
passthrough:
chat_template_kwargs:
enable_thinking: false

For GPT-OSS-style chat completions, request-level reasoning controls use different fields:

defaultTest:
options:
provider:
id: openai:chat:gpt-oss-20b
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
showThinking: false
passthrough:
include_reasoning: false
reasoning_effort: low

Keep showThinking: false even when you pass model-specific controls such as include_reasoning: false or chat_template_kwargs.enable_thinking: false. Those controls save tokens when vLLM honors them; showThinking: false is the promptfoo-side guard.

Provider maps for text and embeddings

Some assertions need a text judge and an embedding model. Use a provider map when a single eval uses both text-graded assertions and embedding-based assertions such as answer-relevance or similar:

defaultTest:
options:
provider:
text:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
showThinking: false
embedding:
id: openai:embedding:intfloat/e5-large-v2
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123

Troubleshooting

SymptomFix
API key is not setSet apiKey in provider config, or set OPENAI_API_KEY. If vLLM was started without --api-key, any placeholder such as empty is fine.
ECONNREFUSEDUse 127.0.0.1 instead of localhost, verify the vLLM port, and confirm Docker or a remote host exposes the port.
Promptfoo calls OpenAI instead of vLLMPut apiBaseUrl: http://.../v1 on the provider object, or set OPENAI_BASE_URL. Do not set apiBaseUrl to /v1/chat/completions.
Judge returns Could not extract JSON, wrong categories, odd RAG scores, or wrong select-best winnersSet showThinking: false on the judge provider and keep the full provider object in defaultTest.options.provider or assert.provider.
Judge output still starts with <think> even with showThinking: falseThe generation was truncated before vLLM split reasoning into reasoning_content. Increase --max-model-len / max_tokens, or disable thinking via passthrough.chat_template_kwargs.enable_thinking: false.
search-rubric uses a different provider than vLLMThis is expected unless the configured provider has web-search capability. Plain vLLM chat is not a search provider; configure a web-search-capable grader for search-rubric.
assert.provider appears to ignore defaultTest.options.provider.configThis is expected precedence. Use the full provider object at the assertion level, or remove assert.provider so the default provider object is inherited.

Run with --no-cache while debugging:

promptfoo eval -c promptfooconfig.yaml --no-cache -o results.json

Then inspect the judge result:

jq '.results.results[].gradingResult.componentResults[] | {pass, score, reason}' results.json