📄️ Testing LLM chains
Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like LangChain, and OpenAI has released built-in support via OpenAI functions.
📄️ Evaluating factuality
promptfoo implements OpenAI's evaluation methodology for factuality, using the factuality assertion type.
📄️ Evaluating RAG pipelines
Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt will be converting into an embedding and matching documents are fetched from a vector store. Then, the LLM is called with the matching documents as part of the prompt.
📄️ OpenAI vs Azure benchmark
Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differences:
📄️ Redteaming a Chatbase Chatbot
Introduction
📄️ Choosing the best GPT model
This guide will walk you through how to compare OpenAI's GPT-4o and GPT-4o-mini, top contenders for the most powerful and effective GPT models. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.
📄️ Claude 3.5 vs GPT-4o
Learn how to benchmark Claude 3.5 against GPT-4o using your own data with promptfoo. Discover which model performs best for your specific use case.
📄️ Cohere Command-R benchmarks
While public benchmarks provide a general sense of capability, the only way to truly understand which model will perform best for your specific application is to run your own custom evaluation.
📄️ Llama vs GPT benchmark
This guide describes how to compare three models - Llama 3.1 405B, GPT 4o, and GPT 4o-mini - using the promptfoo CLI.
📄️ DBRX benchmarks
There are many generic benchmarks that measure LLMs like DBRX, Mixtral, and others in a similar performance class. But public benchmarks are often gamed and don't always reflect real use cases.
📄️ Deepseek benchmark
Deepseek is a new Mixture-of-Experts (MoE) model that's all the rage due to its impressive performance, especially in code tasks. Its MoE architecture has 671B total parameters, though only 37B are activated for each token. This allows for efficient inference while maintaining powerful capabilities.
📄️ Evaluating JSON outputs
Getting an LLM to output valid JSON can be a difficult task. There are a few failure modes:
📄️ Choosing the right temperature for your LLM
The temperature` setting in language models is like a dial that adjusts how predictable or surprising the responses from the model will be, helping application developers fine-tune the AI's creativity to suit different tasks.
📄️ Evaluating OpenAI Assistants
OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.
📄️ Evaluating Replicate Lifeboat
Replicate put together a "Lifeboat" OpenAI proxy that allows you to swap to their hosted Llama2-70b instances. They are generously providing this API for free for a week.
📄️ Gemini vs GPT
When comparing Gemini with GPT, you'll find plenty of eval and opinions online. Model capabilities set a ceiling on what you're able to accomplish, but in my experience most LLM apps are highly dependent on their prompting and use case.
📄️ Gemma vs Llama
Comparing Google's Gemma and Meta's Llama involves more than just looking at their specs and reading about generic benchmarks. The true measure of their usefulness comes down to how they perform on the specific tasks you need them for, in the context of your specific application.
📄️ Gemma vs Mistral/Mixtral
When comparing the performance of LLMs, it's best not to rely on generic benchmarks. This guide shows you how to set up a comprehensive benchmark that compares Gemma vs Mistral vs Mixtral.
📄️ GPT 3.5 vs GPT 4
This guide will walk you through how to compare OpenAI's GPT-3.5 and GPT-4 using promptfoo. This testing framework will give you the chance to test the models' reasoning capabilities, cost, and latency.
📄️ GPT-4o vs GPT-4o-mini
OpenAI released gpt-4o-mini, a highly cost-efficient small model designed to expand the range of applications built with AI by making intelligence more affordable. GPT-4o mini surpasses GPT-3.5 Turbo in performance and affordability, and while it is more cost-effective than GPT-4o, it maintains strong capabilities in both textual intelligence and multimodal reasoning.
📄️ gpt-4o vs o1
Learn how to benchmark OpenAI o1 and o1-mini. Discover which model performs best for your specific use case.
📄️ Using LangChain PromptTemplate with Promptfoo
LangChain PromptTemplate is commonly used to format prompts with injecting variables. Promptfoo allows you to evaluate and test your prompts systematically. Combining the two can streamline your workflow, enabling you to test the prompts that use LangChain PromptTemplate in application code directly within Promptfoo.
📄️ Uncensored Llama2 benchmark
Most LLMs go through fine-tuning that prevents them from answering questions like "How do you make Tylenol", "Who would win in a fist fight...", and "Write a recipe for dangerously spicy mayo."
📄️ How to red team LLM applications
Promptfoo is a popular open source evaluation framework that includes LLM red team and penetration testing capabilities.
📄️ Mistral vs Llama
When Mistral was was released, it was the "best 7B model to date" based on a number of evals. Mixtral, a mixture-of-experts model based on Mistral, was recently announced with even more impressive eval performance.
📄️ Mixtral vs GPT
In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-4o-mini, and GPT-4o. We will use promptfoo, a command-line interface (CLI) tool, to run evaluations and compare the performance of these models based on a set of prompts and test cases.
📄️ Phi vs Llama
When choosing between LLMs like Phi 3 and Llama 3.1, it's important to benchmark them on your specific use cases rather than relying solely on public benchmarks. When models are in the same ballpark, the specific application makes a big difference.
📄️ Preventing hallucinations
LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known as hallucination. Factuality and LLM "grounding" are key concerns for developers building LLM applications.
📄️ Qwen vs Llama vs GPT
As a product developer using LLMs, you are likely focused on a specific use case. Generic benchmarks are easily gamed and often not applicable to specific product needs. The best way to improve quality in your LLM app is to construct your own benchmark.
📄️ Sandboxed Evaluations of LLM-Generated Code
You're using LLMs to generate code snippets, functions, or even entire programs. Blindly trusting and executing this generated code in our production environments - or even in development environments - can be a severe security risk.
📄️ Evaluating LLM text-to-SQL performance
Promptfoo is a command-line tool that allows you to test and validate text-to-SQL conversions.