


For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step
May 06, 2024 pm 03:52 PMWe are familiar with open source large language models such as Llama 3 launched by Meta, Mistral and Mixtral models launched by Mistral AI, and Jamba launched by AI21 Lab, which have become competitors of OpenAI. .
In most cases, users need to fine-tune these open source models based on their own data to fully unleash the model's potential.
It is not difficult to fine-tune a large language model (such as Mistral) compared to a small one using Q-Learning on a single GPU, but efficient fine-tuning of a large model like Llama 370b or Mixtral has remained a challenge until now.
Therefore, Philipp Schmid, technical director of Hugging Face, explains how to fine-tune Llama 3 using PyTorch FSDP and Q-Lora, with the help of Hugging Face's TRL, Transformers, peft and datasets libraries. In addition to FSDP, the author also adapted Flash Attention v2 after PyTorch 2.2 update.
The main steps for fine-tuning are as follows:
- Set up the development environment
- Create and load the data set
- Use PyTorch FSDP, Q-Lora and SDPA for fine-tuning Large language model
- Test the model and perform inference
Please note: The experiments conducted in this article were created and verified on NVIDIA H100 and NVIDIA A10G GPUs. Profiles and code are optimized for 4xA10G GPUs, each with 24GB of memory. If the user has more computing power, the configuration file (yaml file) mentioned in step 3 needs to be modified accordingly.
FSDP Q-Lora background knowledge
Based on the collaborative project jointly participated by Answer.AI, Q-Lora founder Tim Dettmers and Hugging Face, the author has a deep understanding of Q-Lora and PyTorch FSDP (fully shared The technical support provided by data parallelism is summarized.
The combination of FSDP and Q-Lora allows users to fine-tune Llama 270b or Mixtral 8x7B on 2 consumer-grade GPUs (24GB). For details, please refer to the article below. Among them, the PEFT library of Hugging Face plays a vital role in this.
Article address: https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
PyTorch FSDP is a data/model parallel technology, which Models can be split across GPUs, reducing memory requirements and enabling larger models to be trained more efficiently. Q-LoRA is a fine-tuning method that leverages quantization and low-rank adapters to efficiently reduce computational requirements and memory footprint.
Set up the development environment
The first step is to install Hugging Face Libraries and Pyroch, including libraries such as trl, transformers and datasets. trl is a new library built on transformers and datasets that makes fine-tuning, RLHF, and alignment of open source large language models easier.
# Install Pytorch for FSDP and FA/SDPA%pip install "torch==2.2.2" tensorboard# Install Hugging Face libraries%pip install--upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"
Next, log in to Hugging Face to get the Llama 3 70b model.
Creating and loading data sets
After the environment is set up, we can start creating and preparing data sets. The microinvocation data set should contain sample samples of the tasks the user wants to solve. Read How to fine-tune LLM with Hugging Face in 2024 to learn more about creating the dataset.
Article address: https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset
The author used the HuggingFaceH4/no_robots data set, which is a high-quality data set containing 10,000 instructions and samples, and has undergone high-quality data annotation. This data can be used for supervised fine-tuning (SFT) to make language models better follow human instructions. The no_robots dataset is modeled after the human instructions dataset described in the InstructGPT paper published by OpenAI, and consists primarily of single-sentence instructions.
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
no_robots The 10,000 samples in the data set are divided into 9,500 training samples and 500 test samples, some of which do not contain system information. The author used the datasets library to load the datasets, added missing system information, and saved them into separate json files. The sample code looks like this:
from datasets import load_dataset# Convert dataset to OAI messagessystem_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""def create_conversation(sample):if sample["messages"][0]["role"] == "system":return sampleelse:sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]return sample# Load dataset from the hubdataset = load_dataset("HuggingFaceH4/no_robots")# Add system message to each conversationcolumns_to_remove = list(dataset["train"].features)columns_to_remove.remove("messages")dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)# Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system messagedataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)# save datasets to diskdataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)
使用 PyTorch FSDP、Q-Lora 和 SDPA 來微調(diào) LLM
接下來使用 PyTorch FSDP、Q-Lora 和 SDPA 對大語言模型進(jìn)行微調(diào)。作者是在分布式設(shè)備中運(yùn)行模型,因此需要使用 torchrun 和 python 腳本啟動訓(xùn)練。
作者編寫了 run_fsdp_qlora.py 腳本,其作用是從磁盤加載數(shù)據(jù)集、初始化模型和分詞器并開始模型訓(xùn)練。腳本使用 trl 庫中的 SFTTrainer 來對模型進(jìn)行微調(diào)。
SFTTrainer 能夠讓對開源大語言模型的有監(jiān)督微調(diào)更加容易上手,具體來說有以下幾點(diǎn):
格式化的數(shù)據(jù)集,包括格式化的多輪會話和指令(已使用)只對完整的內(nèi)容進(jìn)行訓(xùn)練,忽略只有 prompts 的情況(未使用)打包數(shù)據(jù)集,提高訓(xùn)練效率(已使用)支持參數(shù)高效微調(diào)技術(shù),包括 Q-LoRA(已使用)為會話級任務(wù)微調(diào)初始化模型和分詞器(未使用,見下文)
注意:作者使用的是類似于 Anthropic/Vicuna 的聊天模板,設(shè)置了「用戶」和「助手」角色。這樣做是因?yàn)榛A(chǔ) Llama 3 中的特殊分詞器(<|begin_of_text|> 及 <|reserved_special_token_XX|>)沒有經(jīng)過訓(xùn)練。
這意味著如果要在模板中使用這些分詞器,還需要對它們進(jìn)行訓(xùn)練,并更新嵌入層和 lm_head,對內(nèi)存會產(chǎn)生額外的需求。如果使用者有更多的算力,可以修改 run_fsdp_qlora.py 腳本中的 LLAMA_3_CHAT_TEMPLATE 環(huán)境變量。
在配置參數(shù)方面,作者使用了新的 TrlParser 變量,它允許我們在 yaml 文件中提供超參數(shù),或者通過明確地將參數(shù)傳遞給 CLI 來覆蓋配置文件中的參數(shù),例如 —num_epochs 10。以下是在 4x A10G GPU 或 4x24GB GPU 上微調(diào) Llama 3 70B 的配置文件。
%%writefile llama_3_70b_fsdp_qlora.yaml# script parametersmodel_id: "meta-llama/Meta-Llama-3-70b" # Hugging Face model iddataset_path: "."# path to datasetmax_seq_len:3072 # 2048# max sequence length for model and packing of the dataset# training parametersoutput_dir: "./llama-3-70b-hf-no-robot" # Temporary output directory for model checkpointsreport_to: "tensorboard" # report metrics to tensorboardlearning_rate: 0.0002# learning rate 2e-4lr_scheduler_type: "constant"# learning rate schedulernum_train_epochs: 3# number of training epochsper_device_train_batch_size: 1 # batch size per device during trainingper_device_eval_batch_size: 1# batch size for evaluationgradient_accumulation_steps: 2 # number of steps before performing a backward/update passoptim: adamw_torch # use torch adamw optimizerlogging_steps: 10# log every 10 stepssave_strategy: epoch # save checkpoint every epochevaluation_strategy: epoch # evaluate every epochmax_grad_norm: 0.3 # max gradient normwarmup_ratio: 0.03 # warmup ratiobf16: true # use bfloat16 precisiontf32: true # use tf32 precisiongradient_checkpointing: true # use gradient checkpointing to save memory# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdpfsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memoryfsdp_config:backward_prefetch: "backward_pre"forward_prefetch: "false"use_orig_params: "false"
注意:訓(xùn)練結(jié)束時(shí),GPU 內(nèi)存使用量會略有增加(約 10%),這是因?yàn)槟P捅4嫠鶐淼拈_銷。所以使用時(shí),請確保 GPU 上有足夠的內(nèi)存來保存模型。
在啟動模型訓(xùn)練階段,作者使用 torchrun 來更加靈活地運(yùn)用樣本,并且易于被調(diào)整,就像 Amazon SageMaker 及 Google Cloud Vertex AI 一樣。
對于 torchrun 和 FSDP,作者需要對環(huán)境變量 ACCELERATE_USE_FSDP 和 FSDP_CPU_RAM_EFFICIENT_LOADING 進(jìn)行設(shè)置,來告訴 transformers/accelerate 使用 FSDP 并以節(jié)省內(nèi)存的方式加載模型。
注意:如果想不使用 CPU offloading 功能,需要更改 fsdp 的設(shè)置。這種操作只適用于內(nèi)存大于 40GB 的 GPU。
本文使用以下命令啟動訓(xùn)練:
!ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 ./scripts/run_fsdp_qlora.py --config llama_3_70b_fsdp_qlora.yaml
預(yù)期內(nèi)存使用情況:
- 使用 FSDP 進(jìn)行全微調(diào)需要約 16 塊 80GB 內(nèi)存的 GPU
- FSDP+LoRA 需要約 8 塊 80GB 內(nèi)存的 GPU
- FSDP+Q-Lora 需要約 2 塊 40GB 內(nèi)存的 GPU
- FSDP+Q-Lora+CPU offloading 技術(shù)需要 4 塊 24GB 內(nèi)存的 GPU,以及一塊具備 22 GB 內(nèi)存的 GPU 和 127 GB 的 CPU RAM,序列長度為 3072、batch 大小為 1。
在 g5.12xlarge 服務(wù)器上,基于包含 1 萬個樣本的數(shù)據(jù)集,作者使用 Flash Attention 對 Llama 3 70B 進(jìn)行 3 個 epoch 的訓(xùn)練,總共需要 45 小時(shí)。每小時(shí)成本為 5.67 美元,總成本為 255.15 美元。這聽起來很貴,但可以讓你在較小的 GPU 資源上對 Llama 3 70B 進(jìn)行微調(diào)。
如果我們將訓(xùn)練擴(kuò)展到 4x H100 GPU,訓(xùn)練時(shí)間將縮短至大約 125 小時(shí)。如果假設(shè) 1 臺 H100 的成本為 5-10 美元 / 小時(shí),那么總成本將在 25-50 美元之間。
我們需要在易用性和性能之間做出權(quán)衡。如果能獲得更多更好的計(jì)算資源,就能減少訓(xùn)練時(shí)間和成本,但即使只有少量資源,也能對 Llama 3 70B 進(jìn)行微調(diào)。對于 4x A10G GPU 而言,需要將模型加載到 CPU 上,這就降低了總體 flops,因此成本和性能會有所不同。
注意:在作者進(jìn)行的評估和測試過程中,他注意到大約 40 個最大步長(將 80 個樣本堆疊為長度為三千的序列)就足以獲得初步結(jié)果。40 個步長的訓(xùn)練時(shí)間約為 1 小時(shí),成本約合 5 美元。
可選步驟:將 LoRA 的適配器融入原始模型
使用 QLoRA 時(shí),作者只訓(xùn)練適配器而不對整個模型做出修改。這意味著在訓(xùn)練過程中保存模型時(shí),只保存適配器權(quán)重,而不保存完整模型。
如果使用者想保存完整的模型,使其更容易與文本生成推理器一起使用,則可以使用 merge_and_unload 方法將適配器權(quán)重合并到模型權(quán)重中,然后使用 save_pretrained 方法保存模型。這將保存一個默認(rèn)模型,可用于推理。
注意:CPU 內(nèi)存需要大于 192GB。
#### COMMENT IN TO MERGE PEFT AND BASE MODEL ##### from peft import AutoPeftModelForCausalLM# # Load PEFT model on CPU# model = AutoPeftModelForCausalLM.from_pretrained(# args.output_dir,# torch_dtype=torch.float16,# low_cpu_mem_usage=True,# )# # Merge LoRA and base model and save# merged_model = model.merge_and_unload()# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")
模型測試和推理
訓(xùn)練完成后,我們要對模型進(jìn)行評估和測試。作者從原始數(shù)據(jù)集中加載不同的樣本,并手動評估模型。評估生成式人工智能模型并非易事,因?yàn)橐粋€輸入可能有多個正確的輸出。閱讀《評估 LLMs 和 RAG,一個使用 Langchain 和 Hugging Face 的實(shí)用案例》可以了解到關(guān)于評估生成模型的相關(guān)內(nèi)容。
文章地址:https://www.philschmid.de/evaluate-llm
import torchfrom peft import AutoPeftModelForCausalLMfrom transformers import AutoTokenizerpeft_model_id = "./llama-3-70b-hf-no-robot"# Load Model with PEFT adaptermodel = AutoPeftModelForCausalLM.from_pretrained(peft_model_id,torch_dtype=torch.float16,quantization_config= {"load_in_4bit": True},device_map="auto")tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
接下來加載測試數(shù)據(jù)集,嘗試生成指令。
from datasets import load_datasetfrom random import randint# Load our test dataseteval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")rand_idx = randint(0, len(eval_dataset))messages = eval_dataset[rand_idx]["messages"][:2]# Test on sampleinput_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)outputs = model.generate(input_ids,max_new_tokens=512,eos_token_id= tokenizer.eos_token_id,do_sample=True,temperature=0.6,top_p=0.9,)response = outputs[0][input_ids.shape[-1]:]print(f"**Query:**\n{eval_dataset[rand_idx]['messages'][1]['content']}\n")print(f"**Original Answer:**\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")print(f"**Generated Answer:**\n{tokenizer.decode(response,skip_special_tokens=True)}")# **Query:**# How long was the Revolutionary War?# **Original Answer:**# The American Revolutionary War lasted just over seven years. The war started on April 19, 1775, and ended on September 3, 1783.# **Generated Answer:**# The Revolutionary War, also known as the American Revolution, was an 18th-century war fought between the Kingdom of Great Britain and the Thirteen Colonies. The war lasted from 1775 to 1783.
至此,主要流程就介紹完了,心動不如行動,趕緊從第一步開始操作吧。
The above is the detailed content of For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Ethereum is a decentralized application platform based on smart contracts, and its native token ETH can be obtained in a variety of ways. 1. Register an account through centralized platforms such as Binance and Ouyiok, complete KYC certification and purchase ETH with stablecoins; 2. Connect to digital storage through decentralized platforms, and directly exchange ETH with stablecoins or other tokens; 3. Participate in network pledge, and you can choose independent pledge (requires 32 ETH), liquid pledge services or one-click pledge on the centralized platform to obtain rewards; 4. Earn ETH by providing services to Web3 projects, completing tasks or obtaining airdrops. It is recommended that beginners start from mainstream centralized platforms, gradually transition to decentralized methods, and always attach importance to asset security and independent research, to

The most suitable tools for querying stablecoin markets in 2025 are: 1. Binance, with authoritative data and rich trading pairs, and integrated TradingView charts suitable for technical analysis; 2. Ouyi, with clear interface and strong functional integration, and supports one-stop operation of Web3 accounts and DeFi; 3. CoinMarketCap, with many currencies, and the stablecoin sector can view market value rankings and deans; 4. CoinGecko, with comprehensive data dimensions, provides trust scores and community activity indicators, and has a neutral position; 5. Huobi (HTX), with stable market conditions and friendly operations, suitable for mainstream asset inquiries; 6. Gate.io, with the fastest collection of new coins and niche currencies, and is the first choice for projects to explore potential; 7. Tra

The real use of battle royale in the dual currency system has not yet happened. Conclusion In August 2023, the MakerDAO ecological lending protocol Spark gave an annualized return of $DAI8%. Then Sun Chi entered in batches, investing a total of 230,000 $stETH, accounting for more than 15% of Spark's deposits, forcing MakerDAO to make an emergency proposal to lower the interest rate to 5%. MakerDAO's original intention was to "subsidize" the usage rate of $DAI, almost becoming Justin Sun's Solo Yield. July 2025, Ethe

Table of Contents Crypto Market Panoramic Nugget Popular Token VINEVine (114.79%, Circular Market Value of US$144 million) ZORAZora (16.46%, Circular Market Value of US$290 million) NAVXNAVIProtocol (10.36%, Circular Market Value of US$35.7624 million) Alpha interprets the NFT sales on Ethereum chain in the past seven days, and CryptoPunks ranked first in the decentralized prover network Succinct launched the Succinct Foundation, which may be the token TGE

What is Treehouse(TREE)? How does Treehouse (TREE) work? Treehouse Products tETHDOR - Decentralized Quotation Rate GoNuts Points System Treehouse Highlights TREE Tokens and Token Economics Overview of the Third Quarter of 2025 Roadmap Development Team, Investors and Partners Treehouse Founding Team Investment Fund Partner Summary As DeFi continues to expand, the demand for fixed income products is growing, and its role is similar to the role of bonds in traditional financial markets. However, building on blockchain

A verbal battle about the value of "creator tokens" swept across the crypto social circle. Base and Solana's two major public chain helmsmans had a rare head-on confrontation, and a fierce debate around ZORA and Pump.fun instantly ignited the discussion craze on CryptoTwitter. Where did this gunpowder-filled confrontation come from? Let's find out. Controversy broke out: The fuse of Sterling Crispin's attack on Zora was DelComplex researcher Sterling Crispin publicly bombarded Zora on social platforms. Zora is a social protocol on the Base chain, focusing on tokenizing user homepage and content

Directory What is Zircuit How to operate Zircuit Main features of Zircuit Hybrid architecture AI security EVM compatibility security Native bridge Zircuit points Zircuit staking What is Zircuit Token (ZRC) Zircuit (ZRC) Coin Price Prediction How to buy ZRC Coin? Conclusion In recent years, the niche market of the Layer2 blockchain platform that provides services to the Ethereum (ETH) Layer1 network has flourished, mainly due to network congestion, high handling fees and poor scalability. Many of these platforms use up-volume technology, multiple transaction batches processed off-chain

Install pyodbc: Use the pipinstallpyodbc command to install the library; 2. Connect SQLServer: Use the connection string containing DRIVER, SERVER, DATABASE, UID/PWD or Trusted_Connection through the pyodbc.connect() method, and support SQL authentication or Windows authentication respectively; 3. Check the installed driver: Run pyodbc.drivers() and filter the driver name containing 'SQLServer' to ensure that the correct driver name is used such as 'ODBCDriver17 for SQLServer'; 4. Key parameters of the connection string
