正課第四節（3) - Data Ingestion and Splitters - RAG技術: 智能助手開發實戰

Data Ingestion- Documentloaders https://python.langchain.com/v0.2/docs/integrations/document_loaders/ 上面這一個langchain頁面我們也必須要熟悉, 但多看就會記得langchain可以專門處裡哪些檔案類型。 ## Text Loader from langchain_community.document_loaders import TextLoader loader=TextLoader('speech.txt') loader text_documents=loader.load() text_documents ## Reading a PDf File from langchain_community.document_loaders import PyPDFLoader loader=PyPDFLoader('attention.pdf') docs=loader.load() docs type(docs[0]) ## Web based loader from langchain_community.document_loaders import WebBaseLoader import bs4 loader=WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",), bs_kwargs=dict(parse_only=bs4.SoupStrainer( class_=("post-title","post-content","post-header") )) ) loader.load() ## Arxiv from langchain_community.document_loaders import ArxivLoader docs = ArxivLoader(query="1706.03762", load_max_docs=2).load() len(docs) from langchain_community.document_loaders import WikipediaLoader docs = WikipediaLoader(query="Generative AI", load_max_docs=2).load() len(docs) print(docs) docs 1.文本分割- 遞歸字元文本分割器 (RecursiveCharacter Text Splitters) 這款文本分割器是用於一般文本的推薦工具。它的參數是一串字元列表，會依序嘗試使用這些字元進行分割，直到段落小到合適為止。預設的字元列表是 ["\n\n", "\n", " ", ""]。這麼做的效果就是盡可能保留完整的段落（然後是句子，再來是單字），因為這些通常會是語義上最緊密相關的文本部分。 - 文本如何分割：依照字元列表進行分割。 - 區塊大小如何測量：依字元數量計算。 -這段解釋就像一個文本的「削皮器」——一步步從大到小，從段落到句子，最後才「剝」到每個字！當你在處理文本時，這個分割器就像你手邊的瑞士刀，可以靈活應對各種情況。簡單又實用，但卻蘊含智慧，真是一個不可或缺的小幫手！ type(docs[0]) How to recursively split text by characters ? from langchain_text_splitters import RecursiveCharacterTextSplitter text_splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50) final_documents=text_splitter.split_documents(docs) final_documents print(final_documents[0]) print(final_documents[1]) ## Text Loader from langchain_community.document_loaders import TextLoader loader=TextLoader('speech.txt') docs=loader.load() docs type(text[0]) 2.如何依字元分割文本- 字元文本分割器這是最簡單的方法，透過指定的字元序列進行分割，預設為 "\n\n"。區塊的長度則是以字元數量來計算。 1. 文本如何分割：使用單一字元分隔符來分割。 2. 區塊大小如何測量：根據字元數量進行計算。 --- 這就像拿一把「分割剪刀」，你只需指定一個符號，文本就會乖乖地按你的要求「切塊」。而且，測量文本大小完全就是看字元數，簡單明瞭！這不就是文本分割界的「快刀手」嗎？ from langchain_community.document_loaders import TextLoader loader=TextLoader('speech.txt') docs=loader.load() docs from langchain_text_splitters import CharacterTextSplitter text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=100,chunk_overlap=20) text_splitter.split_documents(docs) speech="" with open("speech.txt") as f: speech=f.read() text_splitter=CharacterTextSplitter(chunk_size=100,chunk_overlap=20) text=text_splitter.create_documents([speech]) print(text[0]) print(text[1]) 3.如何依 HTML 標題分割文本(HTMLHeaderTextSplitter)? HTMLHeaderTextSplitter 是一種具「結構感知」的分割工具，它能在 HTML 元素層級上進行文本分割，並為每個與該區塊相關的標題添加元數據。這個工具可以逐個元素返回分割後的區塊，或者將具有相同元數據的元素合併，達到以下兩個目標：(a) 保持相關文本在語義上盡量組合在一起，(b) 保留文件結構中編碼的豐富上下文信息。它也可以與其他文本分割工具結合，作為分割流程中的一部分。 --- 就像是 HTML 頁面的「文本工匠」，這工具不僅分割文本，還能依據頁面結構智能地「標記」每個段落。它不僅讓相關內容「團聚」，還會照顧到每個頁面元素的背景脈絡。這就像是為你文本中的每個段落戴上專屬的名牌，讓它們既有條理又有故事！ from langchain_text_splitters import HTMLHeaderTextSplitter html_string = """ ###draft_code_symbol_lessthen###!DOCTYPE html> ###draft_code_symbol_lessthen###html> ###draft_code_symbol_lessthen###body> ###draft_code_symbol_lessthen###div> ###draft_code_symbol_lessthen###h1>Foo###draft_code_symbol_lessthen###/h1> ###draft_code_symbol_lessthen###p>Some intro text about Foo.###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###div> ###draft_code_symbol_lessthen###h2>Bar main section###draft_code_symbol_lessthen###/h2> ###draft_code_symbol_lessthen###p>Some intro text about Bar.###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###h3>Bar subsection 1###draft_code_symbol_lessthen###/h3> ###draft_code_symbol_lessthen###p>Some text about the first subtopic of Bar.###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###h3>Bar subsection 2###draft_code_symbol_lessthen###/h3> ###draft_code_symbol_lessthen###p>Some text about the second subtopic of Bar.###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###/div> ###draft_code_symbol_lessthen###div> ###draft_code_symbol_lessthen###h2>Baz###draft_code_symbol_lessthen###/h2> ###draft_code_symbol_lessthen###p>Some text about Baz###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###/div> ###draft_code_symbol_lessthen###br> ###draft_code_symbol_lessthen###p>Some concluding text about Foo###draft_code_symbol_lessthen###/p> ###draft_code_symbol_lessthen###/div> ###draft_code_symbol_lessthen###/body> ###draft_code_symbol_lessthen###/html> """ headers_to_split_on=[ ("h1","Header 1"), ("h2","Header 2"), ("h3","Header 3") ] html_splitter=HTMLHeaderTextSplitter(headers_to_split_on) html_header_splits=html_splitter.split_text(html_string) html_header_splits url = "https://plato.stanford.edu/entries/goedel/" headers_to_split_on = [ ("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4"), ] html_splitter = HTMLHeaderTextSplitter(headers_to_split_on) html_header_splits = html_splitter.split_text_from_url(url) html_header_splits 4.如何分割 JSON 資料這款 JSON 分割器可以控制區塊大小，透過深度優先遍歷 JSON 資料，並生成較小的 JSON 區塊。它會盡量保持巢狀的 JSON 物件完整，但如果需要維持區塊大小在最小與最大範圍內，則會對其進行分割。如果值不是巢狀的 JSON，而是非常大的字串，那麼該字串不會被分割。如果需要嚴格限制區塊大小，可以將這個分割器與遞歸文本分割器結合使用。還有一個可選的預處理步驟，可以先將列表轉換為 JSON（字典）格式，然後再進行分割。 - 文本如何分割：依據 JSON 值進行分割。 - 區塊大小如何測量：依字元數量計算。 --- 這個工具就像是 JSON 的「深海潛水員」，在資料中進行深度探索，一路「攜帶」出精簡的區塊。它的目標是保持巢狀結構不被打破，但如果大小不合適，會很果斷地進行「拆解」。而對於巨大字串，它可不會隨便動刀，除非你特別指定！這樣的分割策略簡直就是 JSON 分割界的「策略大師」啊！ import json import requests json_data=requests.get("https://api.smith.langchain.com/openapi.json").json() json_data from langchain_text_splitters import RecursiveJsonSplitter json_splitter=RecursiveJsonSplitter(max_chunk_size=300) json_chunks=json_splitter.split_json(json_data) json_chunks for chunk in json_chunks[:3]: print(chunk) ## The splitter can also output documents docs=json_splitter.create_documents(texts=[json_data]) for doc in docs[:3]: print(doc) texts=json_splitter.split_text(json_data) print(texts[0]) print(texts[1]) 5.進階補充, 但是是重要技巧。 !pip install llama-index !pip install openai Llamaindex 是基於大型語言模型（LLM）的應用程式的數據框架。它支援攝取不同類型的外部數據源，構建檢索增強生成（RAG）系統，並將與其他 LLM 的集成抽象到幾行代碼中。此外，它還提供了多種技術來使用 LLM 生成更準確的輸出。默認情況下，它使用 OpenAI 模型生成文字，使用 OpenAI 模型進行檢索和嵌入。要使用這些模型，必須在 OpenAI 平臺上創建一個免費帳戶並獲取 OpenAI API 密鑰。要獲取金鑰，請存取 OpenAI API 文件。gpt-3.5-turbotext-embedding-ada-002 import os os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" 現在我們的軟體包已經安裝好了，我們將構建一個RAG系統，根據Emma Stone、Ryan Gosling和 La La Land 的Wikipedia 頁面來回答問題。首先，我們需要安裝庫來提取Wikipedia頁面：wikipedia !pip install wikipedia 然後，我們可以輕鬆地從Wikipedia 下載數據：提取數據后，我們可以將文檔拆分為 256 個字元的塊，沒有重疊。稍後，這些塊使用嵌入模型轉換為數值向量，並在 Vector Store 中編制索引。 # Initialize the gpt3.5 model gpt3 = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct", api_key=OPENAI_API_KEY) # Initialize the embedding model embed_model = OpenAIEmbedding(model= OpenAIEmbeddingModelType.TEXT_EMBED_ADA_002, api_key=OPENAI_API_KEY) # Transform chunks into numerical vectors using the embedding model service_context_gpt3 = ServiceContext.from_defaults(llm=gpt3, chunk_size = 256, chunk_overlap=0, embed_model=embed_model) index = VectorStoreIndex.from_documents(documents, service_context=service_context_gpt3) retriever = index.as_retriever(similarity_top_k=3) 為了降低幻覺的風險，我們使用該模組來確保 LLM 的答案僅基於所提供的上下文。PromptTemplate from llama_index.core.prompts import PromptTemplate # Build a prompt template to only provide answers based on the loaded documents template = ( "We have provided context information below. \n" "---------------------\n" "{context_str}" "\n---------------------\n" "Given this information, please answer the question: {query_str}\n" "Don't give an answer unless it is supported by the context above.\n" ) qa_template = PromptTemplate(template) 現在我們已經設置了RAG系統，我們可以根據檢索到的文件使用問題對其進行測試。讓我們來測試一下吧！問題 1：“導致艾瑪·斯通獲得她的第一個奧斯卡金像獎的電影情節是什麼？第一個查詢具有挑戰性，因為它要求模型查看不同的資訊：導致艾瑪·斯通（Emma Stone）獲得第一個奧斯卡金像獎的電影名稱（在本例中為《愛樂之城》）那部特定電影的情節。 # Create a prompt for the model question = "What is the plot of the film that led Emma Stone to win her first Academy Award?" # Retrieve the context from the model contexts = retriever.retrieve(question) context_list = [n.get_content() for n in contexts] prompt = qa_template.format(context_str="\n\n".join(context_list), query_str=question) # Generate the response response = gpt3.complete(prompt) print(str(response)) 輸出： The plot of the film that made Emma Stone win her first Academy Award is not explicitly mentioned in the provided context. 問題 2：“比較 Emma Stone 和 Ryan Gosling 的家庭” 第二個查詢比上一個問題更具挑戰性，因為它要求選擇有關兩個參與者的家庭的相關塊。 # Create a prompt for the model question = "Compare the families of Emma Stone and Ryan Gosling" # Retrieve the context from the model contexts = retriever.retrieve(question) context_list = [n.get_content() for n in contexts] prompt = qa_template.format(context_str="\n\n".join(context_list), query_str=question) # Generate the response response = gpt3.complete(prompt) print(str(response)) 我們收到以下輸出： Based on the context provided, it is not possible to compare the families of Emma Stone and Ryan Gosling as the information focuses on their professional collaboration and experiences while working on the film "La La Land." There is no mention of their personal family backgrounds or relationships in the context provided. 正如你所看到的，我們在這兩種情況下都收到了不合格的答案。在以下部分中，讓我們探索提高此RAG系統性能的方法！通過更新塊大小來提高RAG性能我們可以從自定義數據塊大小和數據塊重疊開始。正如我們上面所說，文檔被分成具有特定重疊的塊。默認情況下，LlamaIndex 使用 1024 作為預設數據塊大小，使用 20 作為預設數據塊重疊。除了這些超參數之外，預設情況下，系統還會檢索前 2 個數據塊。例如，我們可以將 chunk 大小固定為 512，chunk overlap 為 50，並增加檢索到的頂部卡盤： # modify default values of chunk size and chunk overlap service_context_gpt3 = ServiceContext.from_defaults(llm=gpt3, chunk_size = 512, chunk_overlap=50, embed_model=embed_model) # build index index = VectorStoreIndex.from_documents( documents, service_context=service_context_gpt3 ) # returns the engine for the index query_engine = index.as_query_engine(similarity_top_k=4) 問題 1：“導致艾瑪·斯通獲得她的第一個奧斯卡金像獎的電影情節是什麼？ # generate the response response = query_engine.query("What is the plot of the film that led Emma Stone to win her first Academy Award?") print(response) 輸出： The film that made Emma Stone win her first Academy Award is a romantic musical called La La Land. 與前面的答案相比，它稍微好一些。它成功地辨認《愛樂之城》是導致艾瑪·斯通獲得她的第一個奧斯卡金像獎的電影，但它無法描述這部電影的情節。問題 2：“比較 Emma Stone 和 Ryan Gosling 的家庭” # generate the response response = query_engine.query("Compare the families of Emma Stone and Ryan Gosling") print(response) 輸出： Emma Stone has expressed her close relationship with her family and mentioned being blessed with great family and people around her. She has also shared about her mother's battle with breast cancer and their celebration by getting matching tattoos. On the other hand, there is no specific information provided about Ryan Gosling's family or his personal relationships in the context. 同樣，RAG 管道的輸出有所改善，但它仍然沒有捕獲有關 Ryan Gosling 家庭的資訊。通過重新排名提高RAG性能隨著數據集的大小和複雜性的增加，選擇相關信息以返回複雜查詢的定製答案變得至關重要。為此，一系列稱為 Re-Ranking 的技術允許您瞭解文本中哪些塊是重要的。他們對文檔進行重新排序和篩選，首先對最相關的文檔進行排名。重新排名有兩種主要方法：使用 Re-Ranking 模型作為嵌入模型的替代技術。它們將查詢和上下文作為輸入，並返回相似性分數而不是嵌入使用 LLM 更有效地捕獲文檔中的語義資訊。在應用這些重新排名方法之前，讓我們評估一下基線 RAG 系統返回的第二個查詢的前三個塊：這是 Re-Ranking 之前的輸出;每個數據塊都有一個WITH SIMILARITY 分數。Node ID Node ID: 9b3817fe-3a3f-4417-83d2-2e2996c8b468 Similarity: 0.8415899563985404 Text: Emily Jean "Emma" Stone (born November 6, 1988) is an American actress and producer. She is the recipient of various accolades, including two Academy Awards, two British Academy Film Awards, and two Golden Globe Awards. In 2017, she was the world's highest-paid actress and named by Time magazine as one of the 100 most influential people in the world. Born and raised in Scottsdale, Arizona, Stone began acting as a child in a theater production of The Wind in the Willows in 2000. As a teenager,... ---------------------------------------------------- Node ID: 1bef0308-8b0f-4f7e-9cd6-92ce5acf811f Similarity: 0.831147173341674 Text: Coincidentally, Gosling turned down the Beast role in Beauty and the Beast in favor of La La Land. Chazelle subsequently decided to make his characters somewhat older, with experience in struggling to make their dreams, rather than younger newcomers just arriving in Los Angeles. Emma Stone plays Mia, an aspiring actress in Los Angeles. Stone has loved musicals since she saw Les Misérables when she was eight years old. She said "bursting into song has always been a real dream of mine", and her ... ---------------------------------------------------- Node ID: 576ae445-b12e-4d20-99b7-5e5a91ee7d74 Similarity: 0.8289486590392277 Text: Stone was named the best-dressed woman of 2012 by Vogue and was included on similar listings by Glamour in 2013 and 2015, and People in 2014. == Personal life == Stone moved from Los Angeles to Greenwich Village, New York, in 2009. In 2016, she moved back to Los Angeles. Despite significant media attention, she refuses to publicly discuss her personal life. Concerned with living a normal life, Stone has said she dislikes receiving paparazzi attention outside her home. She has expressed her ... 使用 FlagEmbeddingReranker 重新排名要檢索相關的塊，我們可以使用 Hugging Face 中的開源 Re-Ranking 模型，稱為 bge-ranker-base 模型。就像 OpenAI API 一樣，使用 Hugging Face 需要您獲取使用者存取權杖。您可以按照此文件從 Hugging Face 建立使用者存取權杖。 HF_TOKEN = userdata.get('HF_TOKEN') os.environ['HF_TOKEN'] = HF_TOKEN 在進一步之前，我們還需要安裝必要的庫以使用 Re-Ranking 模型： %pip install llama-index-postprocessor-flag-embedding-reranker !pip install git+https://github.com/FlagOpen/FlagEmbedding.git 最後，我們使用模型返回最相關的塊。bge-ranker-base # Import packages from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker from llama_index.core.schema import QueryBundle # Re-Rank chunks based on the bge-reranker-base-model reranker = FlagEmbeddingReranker( top_n = 3, model = "BAAI/bge-reranker-base", ) # Return the updated chunks query_bundle = QueryBundle(query_str=query) ranked_nodes = reranker._postprocess_nodes(nodes, query_bundle = query_bundle) for ranked_node in ranked_nodes: print('----------------------------------------------------') display_source_node(ranked_node, source_length = 500) 這是 Re-Ranking 後的結果： Node ID: 9b3817fe-3a3f-4417-83d2-2e2996c8b468 Similarity: 3.0143558979034424 Text: Emily Jean "Emma" Stone (born November 6, 1988) is an American actress and producer. She is the recipient of various accolades, including two Academy Awards, two British Academy Film Awards, and two Golden Globe Awards. In 2017, she was the world's highest-paid actress and named by Time magazine as one of the 100 most influential people in the world. Born and raised in Scottsdale, Arizona, Stone began acting as a child in a theater production of The Wind in the Willows in 2000. As a teenager,... ---------------------------------------------------- Node ID: 576ae445-b12e-4d20-99b7-5e5a91ee7d74 Similarity: 2.2117154598236084 Text: Stone was named the best-dressed woman of 2012 by Vogue and was included on similar listings by Glamour in 2013 and 2015, and People in 2014. == Personal life == Stone moved from Los Angeles to Greenwich Village, New York, in 2009. In 2016, she moved back to Los Angeles. Despite significant media attention, she refuses to publicly discuss her personal life. Concerned with living a normal life, Stone has said she dislikes receiving paparazzi attention outside her home. She has expressed her ... ---------------------------------------------------- Node ID: 1bef0308-8b0f-4f7e-9cd6-92ce5acf811f Similarity: 1.6185210943222046 Text: Coincidentally, Gosling turned down the Beast role in Beauty and the Beast in favor of La La Land. Chazelle subsequently decided to make his characters somewhat older, with experience in struggling to make their dreams, rather than younger newcomers just arriving in Los Angeles.Emma Stone plays Mia, an aspiring actress in Los Angeles. Stone has loved musicals since she saw Les Misérables when she was eight years old. She said "bursting into song has always been a real dream of mine", and her ... 從輸出中可以清楚地看出，ID 為 ID 的節點從第二個位置切換到第三個位置。此外，值得注意的是，相似性分數中的可變性更大。1bef0308-8b0f-4f7e-9cd6-92ce5acf811f 現在我們已經使用了 Re-Ranking，讓我們評估一下 RAG 對原始查詢的回應現在是什麼樣子的： # Initialize the query engine with Re-Ranking query_engine = index.as_query_engine( similarity_top_k = 3, node_postprocessors=[reranker] ) # Print the response from the model response = query_engine.query("Compare the families of Emma Stone and Ryan Gosling") print(response) 這是應用 Re-Ranking 模型後提供的回應： Both Emma Stone and Ryan Gosling have close relationships with their families. Stone has expressed her gratitude for having a great family and people around her who keep her grounded. Gosling, on the other hand, has drawn from his own experiences as an aspiring artist, indicating a connection to his personal background. 與之前的回應相比，有顯著的改進，但仍然不完整。讓我們評估一下基於 LLM 的重新排名方法如何説明提高 RAG 性能。但這個已經不是這堂課的範疇, 主要是讓大家看透過chunk的設計和改變可以增加檢所的表現還有其他更多種方法，之後有機會會再課堂上再談到，那就是更進階的RAG技巧了...。