index

sidebar_position: 3

比较评估器

LangChain中的比较评估器帮助衡量两个不同的链或LLM输出。这些评估器对于比较分析非常有用，例如在两种语言模型之间进行A/B测试，或者比较同一模型的不同版本。它们也可以用于生成AI辅助强化学习的偏好分数。

这些评估器继承自PairwiseStringEvaluator类，为两个字符串提供比较接口 - 通常是两个不同提示或模型的输出，或同一模型的两个版本。本质上，比较评估器对一对字符串进行评估，并返回一个包含评估分数和其他相关详细信息的字典。

要创建自定义比较评估器，继承PairwiseStringEvaluator类并覆盖_evaluate_string_pairs方法。如果你需要异步评估，也要覆盖_aevaluate_string_pairs方法。

以下是比较评估器的关键方法和属性的概述：

evaluate_string_pairs：评估输出字符串对。创建自定义评估器时应覆盖此函数。
aevaluate_string_pairs：异步评估输出字符串对。此函数应在异步评估时被覆盖。
requires_input：此属性指示此评估器是否需要输入字符串。
requires_reference：此属性指定此评估器是否需要参考标签。

LangSmith Support

run_on_dataset评估方法设计为一次只评估一个模型，因此，不支持这些评估器。

关于创建自定义评估器和可用的内置比较评估器的详细信息在以下部分提供。

📄️ 字符串评估器

字符串评估器是LangChain中的一个组件，设计用于通过将其生成的输出（预测）与参考字符串或输入进行比较，来评估语言模型的性能。这种比较是评估语言模型的关键步骤，提供了生成文本的准确性或质量的度量。

📄️ Agent Benchmarking: Search + Calculator

Here we go over how to benchmark performance of an agent on tasks where it has access to a calculator and a search tool.

📄️ Agent VectorDB Question Answering Benchmarking

Here we go over how to benchmark performance on a question answering task using an agent to route between multiple vectordatabases.

📄️ 基准模板

这是一个示例笔记本，可用于为您选择的任务创建基准笔记本。评估非常困难，因此我们非常欢迎任何可以使人们更容易进行实验的贡献

📄️ Data Augmented Question Answering

This notebook uses some generic prompts/language models to evaluate an question answering system that uses other sources of data besides what is in the model. For example, this can be used to evaluate a question answering system over your proprietary data.

📄️ Generic Agent Evaluation

Good evaluation is key for quickly iterating on your agent's prompts and tools. Here we provide an example of how to use the TrajectoryEvalChain to evaluate your agent.

📄️ 使用Hugging Face Datasets

这个示例展示了如何使用Hugging Face数据集来评估模型。具体来说，我们展示了如何加载示例以评估来自Hugging Face数据集包的模型。

📄️ LLM数学

评估会做数学的链。

📄️ Evaluating an OpenAPI Chain

This notebook goes over ways to semantically evaluate an OpenAPI Chain, which calls an endpoint defined by the OpenAPI specification using purely natural language.

📄️ 问题回答基准测试：Paul Graham Essay

在这里，我们将介绍如何在Paul Graham的文章上对问题回答任务的性能进行基准测试。

📄️ 问题回答基准测试: 国情咨文

在这里，我们将介绍如何对国情咨文上的问题回答任务进行性能基准测试。

📄️ QA生成

本笔记本展示了如何使用QAGenerationChain来生成特定文档的问题-答案对。

📄️ Question Answering

This notebook covers how to evaluate generic question answering problems. This is a situation where you have an example containing a question and its corresponding ground truth answer, and you want to measure how well the language model does at answering those questions.

📄️ SQL 问题回答基准测试：Chinook

在这里，我们将介绍如何对 SQL 数据库上的问题回答任务进行性能基准测试。

index

sidebar_position: 3

比较评估器

📄️ 字符串评估器

📄️ 示例

📄️ Agent Benchmarking: Search + Calculator

📄️ Agent VectorDB Question Answering Benchmarking

📄️ 基准模板

📄️ index

📄️ Data Augmented Question Answering

📄️ Generic Agent Evaluation

📄️ 使用Hugging Face Datasets

📄️ LLM数学

📄️ Evaluating an OpenAPI Chain

📄️ 问题回答基准测试：Paul Graham Essay

📄️ 问题回答基准测试: 国情咨文

📄️ QA生成

📄️ Question Answering

📄️ SQL 问题回答基准测试：Chinook

sidebar_position: 3​

比较评估器

📄️ 字符串评估器

📄️ 示例

📄️ Agent Benchmarking: Search + Calculator

📄️ Agent VectorDB Question Answering Benchmarking

📄️ 基准模板

📄️ index

📄️ Data Augmented Question Answering

📄️ Generic Agent Evaluation

📄️ 使用Hugging Face Datasets

📄️ LLM数学

📄️ Evaluating an OpenAPI Chain

📄️ 问题回答基准测试：Paul Graham Essay

📄️ 问题回答基准测试: 国情咨文

📄️ QA生成

📄️ Question Answering

📄️ SQL 问题回答基准测试：Chinook

sidebar_position: 3