將 Llama 2 與 Hugging Face 和 Langchain 整合 - 生成式AI從No-Code到Low-Code開發與應用四部曲

1. 課程一開始會帶大家一起看這篇meta的強大論文 llama2 [2307.09288] Llama 2: Open Foundation and Fine-Tuned Chat ModelsIn this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. Paper page - LLaMA: Open and Efficient Foundation Language ModelsJoin the discussion on this paper pagehuggingface.co LainChain - Llama2Chat 本part Note展示如何使用Llama2Chat包裝器增強 Llama-2LLM來支援 Llama-2 聊天提示格式。 LangChain 中的多個LLM實作可以用作 Llama-2 聊天模型的介面。其中包括 HuggingFaceTextGenInference、LlamaCpp、GPT4All 等，僅舉幾個例子。 Llama2Chat是實作BaseChatModel的通用包裝器，因此可以在應用程式中用作聊天模型。Llama2Chat將聊天訊息清單轉換為所需的聊天提示格式，並將格式化提示作為str轉送到包裝的LLM。 from langchain.chains import LLMChain from langchain.memory import ConversationBufferMemory from langchain_experimental.chat_models import Llama2Chat 對於下面的聊天應用程式範例，我們將使用以下聊天 prompt_template from langchain.prompts.chat import ( ChatPromptTemplate, HumanMessagePromptTemplate, MessagesPlaceholder, ) from langchain.schema import SystemMessage template_messages = [ SystemMessage(content="You are a helpful assistant."), MessagesPlaceholder(variable_name="chat_history"), HumanMessagePromptTemplate.from_template("{text}"), ] prompt_template = ChatPromptTemplate.from_messages(template_messages) 透過 HuggingFaceTextGenInference LLM 與 Llama-2 聊天 HuggingFaceTextGenInference LLM 封裝對文字產生推理伺服器的存取。在以下範例中，推理伺服器提供 meta-llama/Llama-2-13b-chat-hf 模型。 from langchain_community.llms import HuggingFaceTextGenInference llm = HuggingFaceTextGenInference( inference_server_url="http://127.0.0.1:8080/", max_new_tokens=512, top_k=50, temperature=0.1, repetition_penalty=1.03, ) model = Llama2Chat(llm=llm) 例如，這適用於具有 4 個 RTX 3080ti 卡的電腦。將--num_shard值調整為可用 GPU 的數量。HF_API_TOKEN環境變數保存 Hugging Face API 令牌。 Note 這個命令是用於在支持Docker的系統上運行一個容器，該容器將會啟動一個基於Hugging Face的文本生成模型服務。具體而言，它使用了`ghcr.io/huggingface/text-generation-inference:0.9`這個Docker映像，並配置了使用`meta-llama/Llama-2-13b-chat-hf`模型。這個命令主要應該在安裝了Docker且配置了NVIDIA GPU支持的Linux系統上執行。以下是具體步驟：安裝Docker: 確保你的系統上安裝了Docker。可以從Docker官網找到安裝指南。安裝NVIDIA Docker支持: 如果你的系統裝有NVIDIA GPU，需要安裝`nvidia-docker`來允許Docker容器訪問GPU資源。請參考NVIDIA官方文檔來進行安裝。設置環境變量: 命令中的`${HF_API_TOKEN}`需要替換為你的Hugging Face API token。你可以在Hugging Face網站上生成這個token。執行命令: 在終端機中執行上述命令。這條命令會啟動一個Docker容器，該容器內運行的服務會監聽8080端口，並將GPU資源分配給模型使用。確保在執行之前，你的系統滿足所有必要的條件，包括安裝和配置好的Docker環境，以及（如果使用GPU）正確安裝的NVIDIA驅動和`nvidia-docker`。此外，確保`~/.cache/huggingface/hub`目錄存在於你的系統中，或者根據需要修改該路徑，以便容器可以正確掛載和訪問數據卷。 !pip3 install text-generation 建立一個連接到本地推理伺服器的 HuggingFaceTextGenInference 實例並將其包裝到Llama2Chat中。 from langchain_community.llms import HuggingFaceTextGenInference llm = HuggingFaceTextGenInference( inference_server_url="http://127.0.0.1:8080/", max_new_tokens=512, top_k=50, temperature=0.1, repetition_penalty=1.03, ) model = Llama2Chat(llm=llm) 然後您就可以在LLMChain中使用聊天model以及prompt_template和對話memory。 memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) chain = LLMChain(llm=model, prompt=prompt_template, memory=memory) print( chain.run( text="What can I see in Vienna? Propose a few locations. Names only, no details." ) ) Sure, I'd be happy to help! Here are a few popular locations to consider visiting in Vienna: 1. Schönbrunn Palace 2. St. Stephen's Cathedral 3. Hofburg Palace 4. Belvedere Palace 5. Prater Park 6. Vienna State Opera 7. Albertina Museum 8. Museum of Natural History 9. Kunsthistorisches Museum 10. Ringstrasse print(chain.run(text="Tell me more about #2.")) 2. 將 Llama 2 與 Hugging Face 和 Langchain 整合使用huggingfece(HF)🤗:開啟您的 Google Colab 筆記本。確保將運行時類型切換到任何可用的 GPU 運行時安裝以下dependency項並提供 Hugging Face 存取API token !pip install -q transformers accelerate langchain !huggingface-cli login transformers:Transformers 提供 API 和工具，可輕鬆下載和訓練最先進的預訓練模型，用於 📝 自然語言處理、🖼️ 電腦視覺、🗣️ 音訊等。accelerate:Accelerate 是一個函式庫，透過新增幾行程式碼，PyTorch 程式碼就可以在任何分散式設定上運作！ langchain:LangChain是一個用於開發由語言模型支援的應用程式的框架。它支援以下應用程式：資料感知：將語言模型連接到其他資料來源 Agentic：允許語言模型與其環境交互 huggingface-cli login:huggingface-cli工具提供了多個用於從命令列與 Hugging Face Hub 互動的命令。這些命令之一是login，它允許使用者使用其憑證在 Hub 上對自己進行身份驗證。 3. 匯入dependency並指定Tokenizer和管道：電腦無法理解文字。因此，我們使用分詞器將文字轉換為電腦可以理解和解釋的數字。管道是從庫中抽象化複雜程式碼並提供簡單 API 供使用的物件。 from transformers import AutoTokenizer import transformers import torch import accelerate model = "meta-llama/Llama-2-7b-chat-hf" tokenizer=AutoTokenizer.from_pretrained(model) pipeline=transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", max_length=1000, do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id ) trust_remote_code（bool，可選，預設為False） - 是否允許在 Hub 上的建模、配置、標記化甚至定義中定義自訂程式碼管道文件。對於您信任的儲存庫並且您已在其中讀取了程式碼，此選項僅應設定為True，因為它將執行本機電腦上的集線器上存在的程式碼。 device_map（str或Dict[str, Union[int, str, torch.device]，可選）- 直接作為model_kwargs發送（只是一個更簡單的快捷方式）。當accelerate庫存在時，設定device_map="auto"自動計算最優化的device_map. Working with large modelsWe’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co do_sample：如果設定為True，則此參數啟用多項式取樣、波束搜尋多項式取樣、Top-K 取樣和 Top-p 取樣等解碼策略。所有這些策略都透過各種特定於策略的調整從整個詞彙表的機率分佈中選擇下一個標記。 top_k（int，可選，預設為 None） - 管道將傳回的頂部標籤的數量。如果提供的數量為None或大於模型配置中可用的標籤數量，則預設為標籤數量。 num_return_sequences：為每個輸入傳回的序列候選數。此選項僅適用於支援多個候選序列的解碼策略，例如波束搜尋和採樣的變化。貪婪搜尋和對比搜尋等解碼策略會傳回單一輸出序列。 4. 運行模型🔥： sequences = pipeline( 'Hi! I like cooking. Can you suggest some recipes?\n') for seq in sequences: print(f"Result: {seq['generated_text']}") Output: Result: Hi! I like cooking. Can you suggest some recipes? I'm glad you're interested in cooking! There are so many delicious recipes out there, but I'll give you a few suggestions to get you started: 1. Chicken Parmesan: Breaded and fried chicken topped with marinara sauce and melted mozzarella cheese. Serve with pasta or a greensalad... 5. 使用 Langchain🦜🔗 定義depency庫 from langchain.llms import HuggingFacePipeline from transformers import AutoTokenizer from langchain.chains import ConversationChain import transformers import torch import warnings warnings.filterwarnings('ignore') 定義Tokenizer、管道和 LLM model="meta-llama/Llama-2-7b-chat-hf" tokenizer=AutoTokenizer.from_pretrained(model) pipeline=transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", max_length=1000, do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id ) llm=HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0.7}) 定義提示開放存取模型的最大優勢之一是可以完全控制聊天應用程式中的system提示。提示模板應該是模型訓練過程中使用的模板。對於 Llama-2 聊天，模板如下所示： ###draft_code_symbol_lessthen###s>[INST] ###draft_code_symbol_lessthen######draft_code_symbol_lessthen###SYS>> {{ system_prompt }} ###draft_code_symbol_lessthen######draft_code_symbol_lessthen###/SYS>> {{ user_message }} [/INST] 我們透過以下方式定義提示範本： prompt_template = """###draft_code_symbol_lessthen###s>[INST] ###draft_code_symbol_lessthen######draft_code_symbol_lessthen###SYS>> {{ You are a helpful AI Assistant}}###draft_code_symbol_lessthen######draft_code_symbol_lessthen###SYS>> ### Previous Conversation: ''' {history} ''' {{{input}}}[/INST] """ prompt = PromptTemplate(template=prompt_template, input_variables=['input', 'history']) 定義鏈鏈就像結合了提示、記憶和不同的 LLMs 等各種功能的實體，以產生所需的輸出。 LangChain 庫的ConversasationChain是用於人工助理對話的Chain，它從記憶體中載入上下文。注意：所有對話都保存在記憶體中，導致緩衝區大小很大，與儲存最後 K 個互動的ConversationBufferWindowMemory不同。 chain = ConversationChain(llm=llm, prompt=prompt) 運行鏈🔥 chain.run("What is the capital Of India?") Output：“您好！我很高興回答您的問題。印度的首都是新德里。您還有其他問題嗎？” 為了防止上下文緩衝區變得非常大，我們可以使用ConversationBufferWindowMemory。只需修改定義鏈的程式碼部分即可。然後，再次運行鏈條！ from langchain.memory import ConversationBufferWindowMemory memory = ConversationBufferWindowMemory(k=5) chain = ConversationChain( llm=llm, prompt=prompt, memory=memory ) see！現在您可以與 Llama-2 交談。然而，這只是開始。開源大型語言模型的潛在應用實際上是無限的。需要將這些模型用於特定用例？不用擔心！您甚至可以根據您的特定需求微調這些模型！仍然對 Llama-2 型號的性能持懷疑態度？以下是一些baseline：值得注意的是，儘管在MMLU 和GSM8K 基準測試中Llama 2 的得分幾乎與GPT-3.5 相同，但在HumanEval（編碼）基準測試中，它的排名卻遠遠落後於GPT-3.5（29.9% vs . 48.1%） — 更不用說 GPT-4，它的性能比 Llama 2 好兩倍多 (67%)。我們等待llama3的到來(已經快了但目前還沒有paper讓我讀QAQ) GitHub - kevin801221/LLM_easyApplication: use llm model which is llama2 and gpt to make some easy applications.use llm model which is llama2 and gpt to make some easy applications. - kevin801221/LLM_easyApplicationgithub.com 接者, 在本地端用我準備了一些應用app的code, 課中會跟大家一起寫，課後程式碼會再public給大家。 llama2 -- Email Generator App - llama2 -- Invoice Extraction Bot - llama2 -- Customer+Care+Call+Summary+Alert -- Code Review Analyst App -- models