📄️ Evaluating factuality
promptfoo implements OpenAI's evaluation methodology for factuality, using the factuality assertion type.
📄️ Evaluating RAG pipelines
Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt will be converting into an embedding and matching documents are fetched from a vector store. Then, the LLM is called with the matching documents as part of the prompt.
📄️ OpenAI vs Azure benchmark
Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differences:
📄️ Llama 2 vs GPT benchmark
This guide describes how to compare three models - Llama v2 70B, GPT 3.5, and GPT 4 - using the promptfoo CLI.
📄️ Evaluating OpenAI Assistants
OpenAI recently released an Assistants API that offers simplified handling for message state and tool usage. It also enables code interpreter and knowledge retrieval features, abstracting away some of the dirty work for implementing RAG architecture.
📄️ Evaluating Replicate Lifeboat
Replicate put together a "Lifeboat" OpenAI proxy that allows you to swap to their hosted Llama2-70b instances. They are generously providing this API for free for a week.
📄️ Uncensored Llama2 benchmark
Most LLMs go through fine-tuning that prevents them from answering questions like "How do you make Tylenol", "Who would win in a fist fight...", and "Write a recipe for dangerously spicy mayo."
📄️ Mistral vs Llama2
Mistral was recently launched as the "best 7B model to date". This claim is made on the basis of a number of evals performed by the researchers.
📄️ Preventing hallucinations
LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known as hallucination. Factuality and LLM "grounding" is a key concern for developers building LLM applications.