Skip to main content

Replicate

Replicate is an API for machine learning models. It currently hosts models like Llama v2, Gemma, and Mistral/Mixtral.

To run a model, specify the Replicate model name and version, like so:

replicate:replicate/llama70b-v2-chat:e951f18578850b652510200860fc4ea62b3b16fac280f83ff32282f87bbd2e48

Examples

Here's an example of using Llama on Replicate. In the case of Llama, the version hash and everything under config is optional:

providers:
- id: replicate:meta/llama-2-7b-chat
config:
temperature: 0.01
max_length: 1024
prompt:
prefix: '[INST] '
suffix: ' [/INST]'

Here's an example of using Gemma on Replicate. Note that unlike Llama, it does not have a default version, so we specify the model version:

providers:
- id: replicate:google-deepmind/gemma-7b-it:2790a695e5dcae15506138cc4718d1106d0d475e6dca4b1d43f42414647993d5
config:
temperature: 0.01
max_new_tokens: 1024
prompt:
prefix: "<start_of_turn>user\n"
suffix: "<end_of_turn>\n<start_of_turn>model"

Configuration

The Replicate provider supports several configuration options that can be used to customize the behavior of the models, like so:

ParameterDescription
temperatureControls randomness in the generation process.
max_lengthSpecifies the maximum length of the generated text.
max_new_tokensLimits the number of new tokens to generate.
top_pNucleus sampling: a float between 0 and 1.
top_kTop-k sampling: number of highest probability tokens to keep.
repetition_penaltyPenalizes repetition of words in the generated text.
system_promptSets a system-level prompt for all requests.
stop_sequencesSpecifies stopping sequences that halt the generation.
seedSets a seed for reproducible results.

Not every model supports every completion parameter. Be sure to review the API provided by the model beforehand.

These parameters are supported for all models:

ParameterDescription
apiKeyThe API key for authentication with Replicate.
prompt.prefixString added before each prompt. Useful for instruction/chat formatting.
prompt.suffixString added after each prompt. Useful for instruction/chat formatting.

Supported environment variables:

  • REPLICATE_API_TOKEN - Your Replicate API key.
  • REPLICATE_API_KEY - An alternative to REPLICATE_API_TOKEN for your API key.
  • REPLICATE_MAX_LENGTH - Specifies the maximum length of the generated text.
  • REPLICATE_TEMPERATURE - Controls randomness in the generation process.
  • REPLICATE_REPETITION_PENALTY - Penalizes repetition of words in the generated text.
  • REPLICATE_TOP_P - Controls the nucleus sampling: a float between 0 and 1.
  • REPLICATE_TOP_K - Controls the top-k sampling: the number of highest probability vocabulary tokens to keep for top-k-filtering.
  • REPLICATE_SEED - Sets a seed for reproducible results.
  • REPLICATE_STOP_SEQUENCES - Specifies stopping sequences that halt the generation.
  • REPLICATE_SYSTEM_PROMPT - Sets a system-level prompt for all requests.