vLLM
vLLM's OpenAI-compatible server
implements /v1/chat/completions, /v1/completions, /v1/responses, and /v1/embeddings.
Promptfoo connects to it through the OpenAI provider by changing apiBaseUrl.
Use this page when:
- vLLM is the model under test
- vLLM is the local LLM-as-a-judge provider for model-graded assertions such as
llm-rubric,g-eval,factuality,answer-relevance,context-*, orselect-best - your vLLM model returns a separate
reasoningfield and promptfoo should grade only the finalcontent
Start a vLLM server
vllm serve Qwen/Qwen3-8B \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen3-8b \
--api-key token-abc123
Then verify the server directly:
curl http://localhost:8000/v1/chat/completions \
-H 'Authorization: Bearer token-abc123' \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-8b",
"messages": [{"role": "user", "content": "Reply with OK"}],
"max_tokens": 8
}'
apiBaseUrl should be the /v1 root. Promptfoo appends /chat/completions,
/completions, or /embeddings depending on the provider type.
For a wiring-only smoke test, a tiny reasoning model such as Qwen/Qwen3-0.6B can verify that
promptfoo reaches vLLM and that showThinking behaves correctly. Do not use a tiny model as a real
judge; use it only to test the endpoint, parser, and config shape:
vllm serve Qwen/Qwen3-0.6B \
--host 0.0.0.0 \
--port 8000 \
--served-model-name llm_judge \
--reasoning-parser qwen3 \
--max-model-len 4096 \
--api-key token-abc123
Example judge models
Keep --served-model-name short and stable; promptfoo uses that alias in
openai:chat:<served-model-name>.
GPT-OSS
For GPT-OSS, use the Hugging Face model name as the vLLM model and expose a short served alias.
The example below uses openai/gpt-oss-20b; use a larger GPT-OSS checkpoint the same way when your
host has enough memory:
vllm serve openai/gpt-oss-20b \
--host 0.0.0.0 \
--port 8000 \
--served-model-name gpt-oss-20b \
--api-key token-abc123
Prefer a Linux CUDA or ROCm host for GPT-OSS with vLLM. If CPU or ARM serving fails, check the vLLM GPT-OSS recipe for the backend support notes that match your vLLM release.
GLM-4.7
For GLM-4.7, use a vLLM and Transformers combination that supports the exact GLM checkpoint you are serving. The vLLM GLM recipe keeps the install guidance for GLM releases:
vllm serve zai-org/GLM-4.7-FP8 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7 \
--tensor-parallel-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--api-key token-abc123
For zai-org/GLM-4.7-Flash, use a served name such as glm-4.7-flash. If you do not need tool
calling, the tool flags are optional for ordinary model-graded assertions; keep
--reasoning-parser glm45 when you want vLLM to split reasoning from final content.
Use vLLM as the target model
prompts:
- '{{question}}'
providers:
- id: openai:chat:qwen3-8b
label: vLLM qwen3-8b
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0.2
max_tokens: 512
tests:
- vars:
question: 'What is the capital of France?'
assert:
- type: contains
value: Paris
For completions models, use openai:completion:<served-model-name> instead of
openai:chat:<served-model-name>.
You can also put the endpoint in an environment variable:
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=token-abc123
Use vLLM as an LLM judge
Model-graded assertions call a separate grading provider. Configure that provider under
defaultTest.options.provider when every model-graded assertion should use the same vLLM judge:
prompts:
- '{{answer}}'
providers:
# System under test. This can be any provider.
- echo
defaultTest:
options:
provider:
id: openai:chat:llm_judge
label: Judge: llm_judge @ vLLM
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
max_tokens: 10000
showThinking: false
tests:
- vars:
answer: 'Use the Forgot password link and verify by email or SMS.'
assert:
- type: llm-rubric
value: |
Pass if the answer explains how to reset a password and mentions a verification step.
Do not repeat provider: openai:chat:llm_judge on an assertion when the full provider object
already lives in defaultTest.options.provider. An assertion-level provider overrides the default
provider object, so the apiBaseUrl, apiKey, showThinking, and other config values above will
not be used.
If only one assertion should use vLLM, put the full object on that assertion:
assert:
- type: llm-rubric
value: 'Answer gives correct password reset instructions'
provider:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
max_tokens: 10000
showThinking: false
Thinking and reasoning models
When vLLM is started with a reasoning parser, responses may include:
message.reasoning_contentormessage.reasoning: hidden reasoning extracted by vLLMmessage.content: final answer
Promptfoo's OpenAI-compatible chat provider includes reasoning in the returned output by default:
Thinking: <reasoning>
<content>
That is useful when vLLM is the target model, because assertions can inspect the full visible output. It is usually wrong when vLLM is the judge, because model-graded assertions consume the judge output as the material to parse, embed, classify, or score. If the reasoning text contains JSON-looking scratchpad content, attribution markers, candidate sentences, or numeric choices, promptfoo can use that scratchpad before the final answer.
Set showThinking: false on vLLM judge providers so promptfoo discards reasoning fields and parses
only content.
This depends on vLLM successfully splitting the response. If the request stops before the model
closes its thinking block, vLLM can return raw <think>... text in message.content instead of
reasoning_content. In that case showThinking: false cannot distinguish scratchpad text from
final content. Increase the server --max-model-len and provider max_tokens, or disable thinking
for judge calls with chat_template_kwargs.enable_thinking: false.
search-rubric is special because it requires web search. A plain vLLM chat server is not a
web-search-capable grader; promptfoo will prefer or load a search-capable provider instead. The
showThinking guidance applies to the search provider that actually grades the assertion.
This applies to every model-graded assertion that consumes text from the judge. JSON-first metrics
can parse scratchpad JSON, RAG metrics can score scratchpad sentences or attribution markers,
answer-relevance can embed generated questions with Thinking: prepended, and select-best can
read a scratchpad number as the winning index.
Disable thinking at the vLLM API level
showThinking: false only changes what promptfoo reads from the response; the model may still spend
tokens thinking. For small local judges and CI smoke tests, disabling thinking is often faster and
avoids truncated <think> content. Qwen3 and GLM chat templates support disabling thinking per
request through chat_template_kwargs:
defaultTest:
options:
provider:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
showThinking: false
passthrough:
chat_template_kwargs:
enable_thinking: false
For GPT-OSS-style chat completions, request-level reasoning controls use different fields:
defaultTest:
options:
provider:
id: openai:chat:gpt-oss-20b
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
showThinking: false
passthrough:
include_reasoning: false
reasoning_effort: low
Keep showThinking: false even when you pass model-specific controls such as
include_reasoning: false or chat_template_kwargs.enable_thinking: false. Those controls save
tokens when vLLM honors them; showThinking: false is the promptfoo-side guard.
Provider maps for text and embeddings
Some assertions need a text judge and an embedding model. Use a provider map when a single eval uses
both text-graded assertions and embedding-based assertions such as answer-relevance or similar:
defaultTest:
options:
provider:
text:
id: openai:chat:llm_judge
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
temperature: 0
showThinking: false
embedding:
id: openai:embedding:intfloat/e5-large-v2
config:
apiBaseUrl: http://localhost:8000/v1
apiKey: token-abc123
Troubleshooting
| Symptom | Fix |
|---|---|
API key is not set | Set apiKey in provider config, or set OPENAI_API_KEY. If vLLM was started without --api-key, any placeholder such as empty is fine. |
ECONNREFUSED | Use 127.0.0.1 instead of localhost, verify the vLLM port, and confirm Docker or a remote host exposes the port. |
| Promptfoo calls OpenAI instead of vLLM | Put apiBaseUrl: http://.../v1 on the provider object, or set OPENAI_BASE_URL. Do not set apiBaseUrl to /v1/chat/completions. |
Judge returns Could not extract JSON, wrong categories, odd RAG scores, or wrong select-best winners | Set showThinking: false on the judge provider and keep the full provider object in defaultTest.options.provider or assert.provider. |
Judge output still starts with <think> even with showThinking: false | The generation was truncated before vLLM split reasoning into reasoning_content. Increase --max-model-len / max_tokens, or disable thinking via passthrough.chat_template_kwargs.enable_thinking: false. |
search-rubric uses a different provider than vLLM | This is expected unless the configured provider has web-search capability. Plain vLLM chat is not a search provider; configure a web-search-capable grader for search-rubric. |
assert.provider appears to ignore defaultTest.options.provider.config | This is expected precedence. Use the full provider object at the assertion level, or remove assert.provider so the default provider object is inherited. |
Run with --no-cache while debugging:
promptfoo eval -c promptfooconfig.yaml --no-cache -o results.json
Then inspect the judge result:
jq '.results.results[].gradingResult.componentResults[] | {pass, score, reason}' results.json