Spring Boot实战指南：从入门到企业级应用构建-CFANZ编程社区

本文翻译整理自：Vector stores and retrievers
https://python.langchain.com/v0.2/docs/tutorials/retrievers/

文章目录

一、说明

本教程将让您熟悉 LangChain 的向量存储和检索器抽象。
这些抽象旨在支持从（向量）数据库和其他来源检索数据，以便与法学硕士工作流程集成。
它们对于获取要作为模型推理的一部分进行推理的数据的应用程序非常重要，例如检索增强生成或 RAG（请参阅此处的RAG 教程）。

概念

本指南重点介绍文本数据的检索。我们将涵盖以下概念：

Documents;
Vector stores;
Retrievers.

项目设置可参考之前文章第二部分

二、文件

LangChain 实现了文档抽象，旨在表示文本单元和相关元数据。它有两个属性：

page_content：代表内容的字符串；
metadata：包含任意元数据的字典。

该metadata属性可以捕获有关文档来源、其与其他文档的关系以及其他信息的信息。
请注意，单个Document对象通常代表较大文档的一部分。

让我们生成一些示例文档：

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

API参考：

Document

在这里，我们生成了五个文档，其中包含指示三个不同“来源”的元数据。

三、Vector stores

矢量搜索是存储和搜索非结构化数据（例如非结构化文本）的常用方法。
这个想法是存储与文本关联的数字向量。
给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似性度量来识别存储中的相关数据。

LangChain VectorStore对象包含向Document存储添加文本和对象以及使用各种相似性度量查询它们的方法。
它们通常使用嵌入模型进行初始化，该模型决定如何将文本数据转换为数字向量。

LangChain 包含一套与不同矢量存储技术的集成。
一些矢量存储由提供商（例如，各种云提供商）托管，并且需要特定的凭据才能使用；有些（例如Postgres）在单独的基础设施中运行，可以在本地或通过第三方运行；其他可以在内存中运行以处理轻量级工作负载。
在这里，我们将使用Chroma演示 LangChain VectorStores 的用法，其中包括内存中实现。

为了实例化向量存储，我们通常需要提供嵌入模型来指定如何将文本转换为数字向量。

这里我们将使用OpenAI 嵌入。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
)

API参考：

OpenAIEmbeddings

调用 .from_documents 此处会将文档添加到矢量存储中。
VectorStore实现了添加文档的方法，这些方法也可以在实例化对象后调用。
大多数实现将允许您连接到现有的向量存储——例如，通过提供客户端、索引名称或其他信息。
有关更多详细信息，请参阅特定集成的文档。

一旦我们实例化了VectorStore包含文档的，我们就可以查询它。 VectorStore包含查询方法：

同步和异步；
按字符串查询和按向量查询；
有或没有返回相似度分数；
通过相似性和最大边际相关性（以平衡查询的相似性和检索结果的多样性）。

这些方法通常会在其输出中包含Document对象的列表。

示例

根据与字符串查询的相似性返回文档：

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

异步查询：

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回分数：

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.

vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.3751849830150604),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.48316916823387146),
 (Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
  0.49601367115974426),
 (Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),
  0.4972994923591614)]

根据与嵌入查询的相似性返回文档：

embedding = OpenAIEmbeddings().embed_query("cat")

vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

了解更多：

API reference
How-to guide
Integration-specific docs

四、Retrievers

LangChain VectorStore对象没有 Runnable 子类，因此不能立即集成到 LangChain 表达语言链中。

LangChain Retrievers 是 Runnables，因此它们实现了一组标准方法（例如，同步和异步invoke以及batch操作），并被设计为合并到 LCEL 链中。

我们可以自己创建一个简单的版本，无需子类化Retriever。
如果我们选择希望使用什么方法来检索文档，我们可以轻松创建一个可运行程序。
下面我们将围绕该similarity_search方法构建一个：

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

API参考

Document
RunnableLambda

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

Vectorstore 实现了as_retriever一个生成 Retriever 的方法，特别是 VectorStoreRetriever。
这些检索器包括特定的search_type 和 search_kwargs 属性，用于标识要调用的底层向量存储的哪些方法以及如何参数化它们。
例如，我们可以用以下内容复制上面的内容：

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever 支持搜索类型 "similarity"（默认）、"mmr"（最大边际相关性，如上所述）和"similarity_score_threshold"。
我们可以使用后者根据相似度得分对检索器输出的文档进行阈值化。

检索器可以轻松融入更复杂的应用程序，例如检索增强生成 (RAG) 应用程序，该应用程序将给定的问题与检索到的上下文结合到 LLM 提示中。
下面我们展示了一个最小示例。

pip install -qU langchain-openai

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API参考：

ChatPromptTemplate
RunnablePassthrough

response = rag_chain.invoke("tell me about cats")

print(response.content)

Cats are independent pets that often enjoy their own space.

五、了解更多

检索策略可能丰富且复杂。例如：

我们可以从查询中推断出硬规则和过滤器（例如，“使用 2020 年之后发布的文档”）；
我们可以以某种方式返回链接到检索到的上下文的文档（例如，通过某些文档分类法）；
我们可以为每个上下文单元生成多个嵌入；
我们可以整合来自多个检索器的结果；
我们可以为文档分配权重，例如，对最近的文档赋予更高的权重。

操作指南的检索器部分涵盖了这些和其他内置检索策略。
扩展 BaseRetriever类以实现自定义检索器也很简单。
请参阅操作指南： https://python.langchain.com/v0.2/docs/how_to/custom_retriever/。

2024-05-22（三）