Google Vertex
The vertex
provider enables integration with Google's Vertex AI platform, which provides access to foundation models including Gemini, PaLM (Bison), Llama, Claude, and specialized models for text, code, and embeddings.
Available Models​
Latest Gemini Models​
vertex:gemini-2.5-pro
- Latest stable Gemini 2.5 Pro model with enhanced reasoning, coding, and multimodal understanding (2M context)vertex:gemini-2.5-flash
- Latest stable Flash model with enhanced reasoning and thinking capabilitiesvertex:gemini-2.5-flash-lite
- Most cost-efficient and fastest 2.5 model yet, optimized for high-volume, latency-sensitive tasksvertex:gemini-2.5-flash-preview-04-17
- Previous Flash preview with thinking capabilities for enhanced reasoningvertex:gemini-2.0-pro
- Strongest model quality, especially for code & world knowledge with 2M context windowvertex:gemini-2.0-flash-thinking-exp
- Enhanced reasoning capabilities with thinking process in responsesvertex:gemini-2.0-flash-001
- Workhorse model for all daily tasks with strong overall performance and real-time streamingvertex:gemini-2.0-flash-lite-preview-02-05
- Cost-effective offering for high throughputvertex:gemini-1.5-flash
- Fast and efficient for high-volume, quality, cost-effective applicationsvertex:gemini-1.5-pro
- Strong performance for text/chat with long-context understandingvertex:gemini-1.5-pro-latest
- Latest Gemini 1.5 Pro model with same capabilities as gemini-1.5-provertex:gemini-1.5-flash-8b
- Small model optimized for high-volume, lower complexity tasks
Claude Models​
Anthropic's Claude models are available with the following versions:
vertex:claude-3-haiku@20240307
- Fast Claude 3 Haikuvertex:claude-3-sonnet@20240229
- Claude 3 Sonnetvertex:claude-3-opus@20240229
- Claude 3 Opus (Public Preview)vertex:claude-3-5-haiku@20241022
- Claude 3.5 Haikuvertex:claude-3-5-sonnet-v2@20241022
- Claude 3.5 Sonnet
Claude models require explicit access enablement through the Vertex AI Model Garden. Navigate to the Model Garden, search for "Claude", and enable the specific models you need.
Note: Claude models support up to 200,000 tokens context length and include built-in safety features.
Llama Models (Preview)​
Meta's Llama models are available through Vertex AI with the following versions:
vertex:llama4-scout-instruct-maas
- Llama 4 Scout 17B (16 experts) with 10M contextvertex:llama4-maverick-instruct-maas
- Llama 4 Maverick 17B (128 experts) with 1M contextvertex:llama-3.3-70b-instruct-maas
- Latest Llama 3.3 70B model (Preview)vertex:llama-3.2-90b-vision-instruct-maas
- Vision-capable Llama 3.2 90B (Preview)vertex:llama-3.1-405b-instruct-maas
- Llama 3.1 405B (GA)vertex:llama-3.1-70b-instruct-maas
- Llama 3.1 70B (Preview)vertex:llama-3.1-8b-instruct-maas
- Llama 3.1 8B (Preview)
Note: Llama models support built-in safety features through Llama Guard. Llama 4 models support up to 10M tokens context length (Scout) and 1M tokens (Maverick) and are natively multimodal, supporting both text and image inputs.
Llama Configuration Example​
providers:
- id: vertex:llama-3.3-70b-instruct-maas
config:
region: us-central1 # Llama models are only available in this region
temperature: 0.7
maxOutputTokens: 1024
llamaConfig:
safetySettings:
enabled: true # Llama Guard is enabled by default
llama_guard_settings: {} # Optional custom settings
- id: vertex:llama4-scout-instruct-maas
config:
region: us-central1
temperature: 0.7
maxOutputTokens: 2048
llamaConfig:
safetySettings:
enabled: true
By default, Llama models use Llama Guard for content safety. You can disable it by setting enabled: false
, but this is not recommended for production use.
Gemma Models (Open Models)​
vertex:gemma
- Lightweight open text model for generation, summarization, and extractionvertex:codegemma
- Lightweight code generation and completion modelvertex:paligemma
- Lightweight vision-language model for image tasks
PaLM 2 (Bison) Models​
Please note the PaLM (Bison) models are scheduled for deprecation (April 2025) and it's recommended to migrate to the Gemini models.
vertex:chat-bison[@001|@002]
- Chat modelvertex:chat-bison-32k[@001|@002]
- Extended context chatvertex:codechat-bison[@001|@002]
- Code-specialized chatvertex:codechat-bison-32k[@001|@002]
- Extended context code chatvertex:text-bison[@001|@002]
- Text completionvertex:text-unicorn[@001]
- Specialized text modelvertex:code-bison[@001|@002]
- Code completionvertex:code-bison-32k[@001|@002]
- Extended context code completion
Embedding Models​
vertex:textembedding-gecko@001
- Text embeddings (3,072 tokens, 768d)vertex:textembedding-gecko@002
- Text embeddings (2,048 tokens, 768d)vertex:textembedding-gecko@003
- Text embeddings (2,048 tokens, 768d)vertex:text-embedding-004
- Latest text embeddings (2,048 tokens, ≤768d)vertex:text-embedding-005
- Latest text embeddings (2,048 tokens, ≤768d)vertex:textembedding-gecko-multilingual@001
- Multilingual embeddings (2,048 tokens, 768d)vertex:text-multilingual-embedding-002
- Latest multilingual embeddings (2,048 tokens, ≤768d)vertex:multimodalembedding
- Multimodal embeddings for text, image, and video
Model Capabilities​
Gemini 2.0 Pro Specifications​
- Max input tokens: 2,097,152
- Max output tokens: 8,192
- Training data: Up to June 2024
- Supports: Text, code, images, audio, video, PDF inputs
- Features: System instructions, JSON support, grounding with Google Search
Language Support​
Gemini models support a wide range of languages including:
- Core languages: Arabic, Bengali, Chinese (simplified/traditional), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai, Turkish, Vietnamese
- Gemini 1.5 adds support for 50+ additional languages including regional and less common languages
If you're using Google AI Studio directly, see the google
provider documentation instead.
Setup and Authentication​
1. Install Dependencies​
Install Google's official auth client:
npm install google-auth-library
2. Enable API Access​
-
Enable the Vertex AI API in your Google Cloud project
-
For Claude models, request access through the Vertex AI Model Garden by:
- Navigating to "Model Garden"
- Searching for "Claude"
- Clicking "Enable" on the models you want to use
-
Set your project in gcloud CLI:
gcloud config set project PROJECT_ID
3. Authentication Methods​
Choose one of these authentication methods:
Option 1: Application Default Credentials (Recommended)​
This is the most secure and flexible approach for development and production:
# First, authenticate with Google Cloud
gcloud auth login
# Then, set up application default credentials
gcloud auth application-default login
# Set your project ID
export VERTEX_PROJECT_ID="your-project-id"
Option 2: Service Account (Production)​
For production environments or CI/CD pipelines:
- Create a service account in your Google Cloud project
- Download the credentials JSON file
- Set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export VERTEX_PROJECT_ID="your-project-id"
Option 3: Direct API Key (Quick Testing)​
For quick testing, you can use a temporary access token:
# Get a temporary access token
export VERTEX_API_KEY=$(gcloud auth print-access-token)
export VERTEX_PROJECT_ID="your-project-id"
Note: Access tokens expire after 1 hour. For long-running evaluations, use Application Default Credentials or Service Account authentication.
4. Environment Variables​
Promptfoo automatically loads environment variables from your shell or a .env
file. Create a .env
file in your project root:
# .env
VERTEX_PROJECT_ID=your-project-id
VERTEX_REGION=us-central1
# Optional: For direct API key authentication
VERTEX_API_KEY=your-access-token
Remember to add .env
to your .gitignore
file to prevent accidentally committing sensitive information.
Configuration​
Environment Variables​
The following environment variables can be used to configure the Vertex AI provider:
Variable | Description | Default | Required |
---|---|---|---|
VERTEX_PROJECT_ID | Your Google Cloud project ID | None | Yes |
VERTEX_REGION | The region for Vertex AI resources | us-central1 | No |
VERTEX_API_KEY | Direct API token (from gcloud auth print-access-token ) | None | No* |
VERTEX_PUBLISHER | Model publisher | google | No |
VERTEX_API_HOST | Override API host (e.g., for proxy) | Auto-generated | No |
VERTEX_API_VERSION | API version | v1 | No |
GOOGLE_APPLICATION_CREDENTIALS | Path to service account credentials | None | No* |
*At least one authentication method is required (ADC, service account, or API key)
Region Selection​
Different models are available in different regions. Common regions include:
us-central1
- Default, most models availableus-east4
- Additional capacityus-east5
- Claude models availableeurope-west1
- EU region, Claude models availableeurope-west4
- EU regionasia-southeast1
- Asia region, Claude models available
Example configuration with specific region:
providers:
- id: vertex:claude-3-5-sonnet-v2@20241022
config:
region: us-east5 # Claude models require specific regions
projectId: my-project-id
Quick Start​
1. Basic Setup​
After completing authentication, create a simple evaluation:
# promptfooconfig.yaml
providers:
- vertex:gemini-2.5-flash
prompts:
- 'Analyze the sentiment of this text: {{text}}'
tests:
- vars:
text: "I love using Vertex AI, it's incredibly powerful!"
assert:
- type: contains
value: 'positive'
- vars:
text: "The service is down and I can't access my models."
assert:
- type: contains
value: 'negative'
Run the eval:
promptfoo eval
2. Multi-Model Comparison​
Compare different models available on Vertex AI:
providers:
# Google models
- id: vertex:gemini-2.5-pro
config:
region: us-central1
# Claude models (require specific region)
- id: vertex:claude-3-5-sonnet-v2@20241022
config:
region: us-east5
# Llama models
- id: vertex:llama-3.3-70b-instruct-maas
config:
region: us-central1
prompts:
- 'Write a Python function to {{task}}'
tests:
- vars:
task: 'calculate fibonacci numbers'
assert:
- type: javascript
value: output.includes('def') && output.includes('fibonacci')
- type: llm-rubric
value: 'The code should be efficient and well-commented'
3. Using with CI/CD​
For automated testing in CI/CD pipelines:
# .github/workflows/llm-test.yml
name: LLM Testing
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.GCP_CREDENTIALS }}
- name: Run promptfoo tests
run: |
npx promptfoo@latest eval
env:
VERTEX_PROJECT_ID: ${{ vars.GCP_PROJECT_ID }}
VERTEX_REGION: us-central1
4. Advanced Configuration Example​
providers:
- id: vertex:gemini-2.5-pro
config:
projectId: ${VERTEX_PROJECT_ID}
region: ${VERTEX_REGION:-us-central1}
generationConfig:
temperature: 0.2
maxOutputTokens: 2048
topP: 0.95
safetySettings:
- category: HARM_CATEGORY_DANGEROUS_CONTENT
threshold: BLOCK_ONLY_HIGH
systemInstruction: |
You are a helpful coding assistant.
Always provide clean, efficient, and well-documented code.
Follow best practices for the given programming language.
Provider Configuration​
Configure model behavior using the following options:
providers:
# For Gemini models
- id: vertex:gemini-2.5-pro
config:
generationConfig:
temperature: 0
maxOutputTokens: 1024
topP: 0.8
topK: 40
# For Llama models
- id: vertex:llama-3.3-70b-instruct-maas
config:
generationConfig:
temperature: 0.7
maxOutputTokens: 1024
extra_body:
google:
model_safety_settings:
enabled: true
llama_guard_settings: {}
# For Claude models
- id: vertex:claude-3-5-sonnet-v2@20241022
config:
anthropic_version: 'vertex-2023-10-16'
max_tokens: 1024
Safety Settings​
Control AI safety filters:
- id: vertex:gemini-pro
config:
safetySettings:
- category: HARM_CATEGORY_HARASSMENT
threshold: BLOCK_ONLY_HIGH
- category: HARM_CATEGORY_VIOLENCE
threshold: BLOCK_MEDIUM_AND_ABOVE
See Google's SafetySetting API documentation for details.
Model-Specific Features​
Llama Model Features​
- Support for text and vision tasks (Llama 3.2 and all Llama 4 models)
- Built-in safety with Llama Guard (enabled by default)
- Available in
us-central1
region - Quota limits vary by model version
- Requires specific endpoint format for API calls
- Only supports unary (non-streaming) responses in promptfoo
Llama Model Considerations​
- Regional Availability: Llama models are available only in
us-central1
region - Guard Integration: All Llama models use Llama Guard for content safety by default
- Specific Endpoint: Uses a different API endpoint than other Vertex models
- Model Status: Most models are in Preview state, with Llama 3.1 405B being Generally Available (GA)
- Vision Support: Llama 3.2 90B and all Llama 4 models support image input
Claude Model Features​
- Support for text, code, and analysis tasks
- Tool use (function calling) capabilities
- Available in multiple regions (us-east5, europe-west1, asia-southeast1)
- Quota limits vary by model version (20-245 QPM)
Advanced Usage​
Default Grading Provider​
When Google credentials are configured (and no OpenAI/Anthropic keys are present), Vertex AI becomes the default provider for:
- Model grading
- Suggestions
- Dataset generation
Override grading providers using defaultTest
:
defaultTest:
options:
provider:
# For llm-rubric and factuality assertions
text: vertex:gemini-1.5-pro-002
# For similarity comparisons
embedding: vertex:embedding:text-embedding-004
Configuration Reference​
Option | Description | Default |
---|---|---|
apiKey | GCloud API token | None |
apiHost | API host override | {region}-aiplatform.googleapis.com |
apiVersion | API version | v1 |
projectId | GCloud project ID | None |
region | GCloud region | us-central1 |
publisher | Model publisher | google |
context | Model context | None |
examples | Few-shot examples | None |
safetySettings | Content filtering | None |
generationConfig.temperature | Randomness control | None |
generationConfig.maxOutputTokens | Max tokens to generate | None |
generationConfig.topP | Nucleus sampling | None |
generationConfig.topK | Sampling diversity | None |
generationConfig.stopSequences | Generation stop triggers | [] |
toolConfig | Tool/function calling config | None |
systemInstruction | System prompt (supports {{var}} and file:// ) | None |
Not all models support all parameters. See Google's documentation for model-specific details.
Troubleshooting​
Authentication Errors​
If you see an error like:
API call error: Error: {"error":"invalid_grant","error_description":"reauth related error (invalid_rapt)","error_uri":"https://support.google.com/a/answer/9368756","error_subtype":"invalid_rapt"}
Re-authenticate using:
gcloud auth application-default login
Claude Model Access Errors​
If you encounter errors like:
API call error: Error: Project is not allowed to use Publisher Model `projects/.../publishers/anthropic/models/claude-*`
or
API call error: Error: Publisher Model is not servable in region us-central1
You need to:
-
Enable access to Claude models:
- Visit the Vertex AI Model Garden
- Search for "Claude"
- Click "Enable" on the specific Claude models you want to use
-
Use a supported region. Claude models are only available in:
us-east5
europe-west1
Example configuration with correct region:
providers:
- id: vertex:claude-3-5-sonnet-v2@20241022
config:
region: us-east5 # or europe-west1
anthropic_version: 'vertex-2023-10-16'
max_tokens: 1024
Model Features and Capabilities​
Function Calling and Tools​
Gemini and Claude models support function calling and tool use. Configure tools in your provider:
providers:
- id: vertex:gemini-2.5-pro
config:
toolConfig:
functionCallingConfig:
mode: 'AUTO' # or "ANY", "NONE"
allowedFunctionNames: ['get_weather', 'search_places']
tools:
- functionDeclarations:
- name: 'get_weather'
description: 'Get weather information'
parameters:
type: 'OBJECT'
properties:
location:
type: 'STRING'
description: 'City name'
required: ['location']
Tools can also be loaded from external files:
providers:
- id: vertex:gemini-2.5-pro
config:
tools: 'file://tools.json' # Supports variable substitution
For practical examples of function calling with Vertex AI models, see the google-vertex-tools example which demonstrates both basic tool declarations and callback execution.
System Instructions​
Configure system-level instructions for the model:
providers:
- id: vertex:gemini-2.5-pro
config:
# Direct text
systemInstruction: 'You are a helpful assistant'
# Or load from file
systemInstruction: file://system-instruction.txt
System instructions support Nunjucks templating and can be loaded from external files for better organization and reusability.
Generation Configuration​
Fine-tune model behavior with these parameters:
providers:
- id: vertex:gemini-2.5-pro
config:
generationConfig:
temperature: 0.7 # Controls randomness (0.0 to 1.0)
maxOutputTokens: 1024 # Limit response length
topP: 0.8 # Nucleus sampling
topK: 40 # Top-k sampling
stopSequences: ["\n"] # Stop generation at specific sequences
Context and Examples​
Provide context and few-shot examples:
providers:
- id: vertex:gemini-2.5-pro
config:
context: 'You are an expert in machine learning'
examples:
- input: 'What is regression?'
output: 'Regression is a statistical method...'
Safety Settings​
Configure content filtering with granular control:
providers:
- id: vertex:gemini-2.5-pro
config:
safetySettings:
- category: 'HARM_CATEGORY_HARASSMENT'
threshold: 'BLOCK_ONLY_HIGH'
- category: 'HARM_CATEGORY_HATE_SPEECH'
threshold: 'BLOCK_MEDIUM_AND_ABOVE'
- category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT'
threshold: 'BLOCK_LOW_AND_ABOVE'
Thinking Configuration​
For models that support thinking capabilities (like Gemini 2.5 Flash), you can configure the thinking budget:
providers:
- id: vertex:gemini-2.5-flash-preview-04-17
config:
generationConfig:
temperature: 0.7
maxOutputTokens: 2048
thinkingConfig:
thinkingBudget: 1024 # Controls tokens allocated for thinking process
The thinking configuration allows the model to show its reasoning process before providing the final answer. This is particularly useful for:
- Complex problem solving
- Mathematical reasoning
- Step-by-step analysis
- Decision making tasks
When using thinking configuration:
- The
thinkingBudget
must be at least 1024 tokens - The budget is counted towards your total token usage
- The model will show its reasoning process in the response
Search Grounding​
Search grounding allows Gemini models to access the internet for up-to-date information, enhancing responses about recent events and real-time data.
Basic Usage​
Use the object format to enable Search grounding:
providers:
- id: vertex:gemini-2.5-pro
config:
tools:
- googleSearch: {}
Combining with Other Features​
You can combine Search grounding with thinking capabilities for better reasoning:
providers:
- id: vertex:gemini-2.5-flash-preview-04-17
config:
generationConfig:
thinkingConfig:
thinkingBudget: 1024
tools:
- googleSearch: {}
Use Cases​
Search grounding is particularly valuable for:
- Current events and news
- Recent developments
- Stock prices and market data
- Sports results
- Technical documentation updates
Working with Response Metadata​
When using Search grounding, the API response includes additional metadata:
groundingMetadata
- Contains information about search results usedgroundingChunks
- Web sources that informed the responsewebSearchQueries
- Queries used to retrieve information
Requirements and Limitations​
- Important: Per Google's requirements, applications using Search grounding must display Google Search Suggestions included in the API response metadata
- Search results may vary by region and time
- Results may be subject to Google Search rate limits
- Search will only be performed when the model determines it's necessary
For more details, see the Google documentation on Grounding with Google Search.
See Also​
- Google AI Studio Provider - For direct Google AI Studio integration
- Vertex AI Examples - Browse working examples for Vertex AI
- Google Cloud Documentation - Official Vertex AI documentation
- Model Garden - Access and enable additional models