Guides

📄️ Testing LLM chains

Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like LangChain, and OpenAI has released built-in support via OpenAI functions.

📄️ Evaluating factuality

How to evaluate the factual accuracy of LLM outputs against reference information using promptfoo's factuality assertion

Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt will be converting into an embedding and matching documents are fetched from a vector store. Then, the LLM is called with the matching documents as part of the prompt.

📄️ HLE Benchmark

Run evaluations against Humanity's Last Exam using promptfoo - the most challenging AI benchmark with expert-crafted questions across 100+ subjects.

📄️ OpenAI vs Azure benchmark

Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differences:

📄️ Red teaming a Chatbase Chatbot

Chatbase is a platform for building custom AI chatbots that can be embedded into websites for customer support, lead generation, and user engagement. These chatbots use RAG (Retrieval-Augmented Generation) to access your organization's knowledge base and maintain conversations with users.

📄️ Choosing the best GPT model

This guide will walk you through how to compare OpenAI's GPT-4o and GPT-4o-mini, top contenders for the most powerful and effective GPT models. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.

📄️ Claude 3.7 vs GPT-4.1

Learn how to benchmark Claude 3.7 against GPT-4.1 using your own data with promptfoo. Discover which model performs best for your specific use case.

📄️ Cohere Command-R benchmarks

While public benchmarks provide a general sense of capability, the only way to truly understand which model will perform best for your specific application is to run your own custom evaluation.

📄️ Llama vs GPT benchmark

This guide describes how to compare three models - Llama 3.1 405B, GPT 4o, and GPT 4o-mini - using the promptfoo CLI.

📄️ DBRX benchmarks

There are many generic benchmarks that measure LLMs like DBRX, Mixtral, and others in a similar performance class. But public benchmarks are often gamed and don't always reflect real use cases.

📄️ Deepseek benchmark

Deepseek is a new Mixture-of-Experts (MoE) model that's all the rage due to its impressive performance, especially in code tasks. Its MoE architecture has 671B total parameters, though only 37B are activated for each token. This allows for efficient inference while maintaining powerful capabilities.

📄️ Evaluating LLM safety with HarmBench

Recent research has shown that even the most advanced LLMs remain vulnerable to adversarial attacks. Recent reports from security researchers have documented threat actors exploiting these vulnerabilities to generate malware variants and evade detection systems, highlighting the importance of robust safety testing for any LLM-powered application.

📄️ Red teaming a CrewAI Agent

CrewAI is a cutting-edge multi-agent platform designed to help teams streamline complex workflows by connecting multiple automated agents. Whether you’re building recruiting bots, research agents, or task automation pipelines, CrewAI gives you a flexible way to run and manage them on any cloud or local setup.

📄️ Evaluating JSON outputs

Getting an LLM to output valid JSON can be a difficult task. There are a few failure modes:

📄️ Choosing the right temperature for your LLM

The temperature` setting in language models is like a dial that adjusts how predictable or surprising the responses from the model will be, helping application developers fine-tune the AI's creativity to suit different tasks.

📄️ Evaluating OpenAI Assistants

OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.

📄️ Evaluating Replicate Lifeboat

Replicate put together a "Lifeboat" OpenAI proxy that allows you to swap to their hosted Llama2-70b instances. They are generously providing this API for free for a week.

📄️ Gemini vs GPT

When comparing Gemini with GPT, you'll find plenty of eval and opinions online. Model capabilities set a ceiling on what you're able to accomplish, but in my experience most LLM apps are highly dependent on their prompting and use case.

📄️ Gemma vs Llama

Comparing Google's Gemma and Meta's Llama involves more than just looking at their specs and reading about generic benchmarks. The true measure of their usefulness comes down to how they perform on the specific tasks you need them for, in the context of your specific application.

📄️ Gemma vs Mistral/Mixtral

When comparing the performance of LLMs, it's best not to rely on generic benchmarks. This guide shows you how to set up a comprehensive benchmark that compares Gemma vs Mistral vs Mixtral.

📄️ GPT 3.5 vs GPT 4

This guide will walk you through how to compare OpenAI's GPT-3.5 and GPT-4 using promptfoo. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.

📄️ GPT-4o vs GPT-4o-mini

OpenAI released gpt-4o-mini, a highly cost-efficient small model designed to expand the range of applications built with AI by making intelligence more affordable. GPT-4o mini surpasses GPT-3.5 Turbo in performance and affordability, and while it is more cost-effective than GPT-4o, it maintains strong capabilities in both textual intelligence and multimodal reasoning.

📄️ GPT-4.1 vs GPT-4o MMLU

Compare GPT-4.1 and GPT-4o performance on MMLU academic reasoning tasks using promptfoo with step-by-step setup and research-backed optimization techniques.

📄️ gpt-4.1 vs o1

Learn how to benchmark OpenAI o1 and o1-mini. Discover which model performs best for your specific use case.

📄️ Using LangChain PromptTemplate with Promptfoo

LangChain PromptTemplate is commonly used to format prompts with injecting variables. Promptfoo allows you to evaluate and test your prompts systematically. Combining the two can streamline your workflow, enabling you to test the prompts that use LangChain PromptTemplate in application code directly within Promptfoo.

📄️ Uncensored Llama2 benchmark

Most LLMs go through fine-tuning that prevents them from answering questions like "How do you make Tylenol", "Who would win in a fist fight...", and "Write a recipe for dangerously spicy mayo."

📄️ How to red team LLM applications

Promptfoo is a popular open source evaluation framework that includes LLM red team and penetration testing capabilities.

📄️ Magistral AIME2024 Benchmark

Reproduce Mistral's Magistral 73.6% AIME2024 mathematical reasoning benchmark using promptfoo with a simple evaluation setup comparing Magistral Medium vs Small.

📄️ Mistral vs Llama

When Mistral was was released, it was the "best 7B model to date" based on a number of evals. Mixtral, a mixture-of-experts model based on Mistral, was recently announced with even more impressive eval performance.

📄️ Mixtral vs GPT

In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-4o-mini, and GPT-4o. We will use promptfoo, a command-line interface (CLI) tool, to run evaluations and compare the performance of these models based on a set of prompts and test cases.

📄️ Multi-Modal Red Teaming

Learn how to use promptfoo to test the robustness of multi-modal LLMs against adversarial inputs involving text, images, and audio.

📄️ Phi vs Llama

When choosing between LLMs like Phi 3 and Llama 3.1, it's important to benchmark them on your specific use cases rather than relying solely on public benchmarks. When models are in the same ballpark, the specific application makes a big difference.

📄️ Preventing hallucinations

LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known as hallucination. Factuality and LLM "grounding" are key concerns for developers building LLM applications.

📄️ Qwen vs Llama vs GPT

As a product developer using LLMs, you are likely focused on a specific use case. Generic benchmarks are easily gamed and often not applicable to specific product needs. The best way to improve quality in your LLM app is to construct your own benchmark.