LLaMA: 开源大预言模型，模型版本：7B（70亿参数量）, 13B, 33B等
知乎：https://zhuanlan.zhihu.com/p/638035946
github：https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E6%89%8B%E5%8A%A8%E6%A8%A1%E5%9E%8B%E5%90%88%E5%B9%B6%E4%B8%8E%E8%BD%AC%E6%8D%A2
bilibili: https://www.bilibili.com/video/BV1xz4y18749/?spm_id_from=333.788&vd_source=0f73a86e66095f29087c390106c3e5ad

1. 部署

1.1 环境搭建

conda env: python=3.10
pytorch安装，保证1.13.1版本以及cuda版本：conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
环境依赖安装：pip install transformers==4.28.1 sentencepiece==0.1.97 google protobuf deepspeed -i https://pypi.tuna.tsinghua.edu.cn/simple
源码安装huggingface微调工具库peft：pip install git+https://github.com/huggingface/peft.git@13e53fc. 注意分支，如果报错 fatal: … is not a repository，检查服务器联网情况

最终确保

pip install torch==1.13.1
pip install transformers==4.28.1
pip install sentencepiece==0.1.97
pip install peft==0.3.0

1.2 原版模型文件下载

在项目目录llama_models下，按照下面目录结构放置，其中 tokenizer_checklist.chk 和 tokenizer.model 是你要用的模型版本文件夹中的，放在外面。

1.3 原版模型转换为huggingface格式的模型

源码下载transformers：https://github.com/huggingface/transformers
运行命令：

python path-to-transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py     \
--input_dir /home/winkids/mbp_python/llama_models     \
--model_size 7B     \
--output_dir /home/winkids/mbp_python/llama_models/7B_hf

path: transformers download path
input_dir: llama_models root path
model_size: 7B, 13B, et al
output_dir: llama_models root 路径下新建一个文件夹。我用的7B模型，新建一个存放转换后的模型的文件夹7B_hf

运行成功后提示：

Fetching all parameters from the checkpoint at /home/winkids/mbp_python/llama_models/7B.
Loading the checkpoint in a Llama model.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:08<00:00,  3.91it/s]
Saving in the Transformers format.

1.4 合并LoRA权重，生成全量模型权重（合并词表）

LoRA（Low-Rank Adaptation of Large Language Models，大型语言模型的低秩适应）是微软研究员提出的一种新颖技术，旨在解决微调大型语言模型的问题。具有数十亿参数的强大模型，如GPT-3，要对其进行微调以适应特定任务或领域的成本非常高。LoRA提议冻结预训练模型的权重，并在每个Transformer块中注入可训练层（称为秩分解矩阵）。这大大减少了可训练参数的数量和GPU内存需求，因为大部分模型权重不需要计算梯度。研究人员发现，通过专注于大型语言模型的Transformer注意力块，LoRA的微调质量与完整模型的微调相当，同时速度更快，计算需求更低。

Chinese-LLaMA-Alpaca是在通用中文语料上训练了基于 sentencepiece 的20K中文词表并与原版LLaMA模型的32K词表进行合并，排除重复的token后，得到的最终中文LLaMA词表大小为49953。这一步骤会对原版LLaMA模型（HF格式）扩充中文词表，合并LoRA权重并生成全量模型权重。

首先下载LoRA权重文件，这里我下载了chinese_llama_plus_lora_7b 和 chinese_alpaca_pro_lora_7b。
同时需要把仓库拉下来：https://github.com/ymcui/Chinese-LLaMA-Alpaca
仓库说明：如希望体验类ChatGPT对话交互，请使用Alpaca模型，而不是LLaMA模型

将原版LLaMA模型转换为HuggingFace格式，将原版LLaMA的tokenizer.model放在—input_dir指定的目录，其余文件放在${input_dir}/${model_size}下。执行以下脚本后，—output_dir中将存放转换好的HF版权重。这里需要用到上面拉取到的仓库中的文件，假设上面路径为DOWNLOAD，运行

// 单lora权重合并，传入一个lora
python DOWNLOAD/scripts/merge_llama_with_chinese_lora.py \
    --base_model /home/winkids/mbp_python/llama_models/7B_hf \
    --lora_model /home/winkids/mbp_python/llama_models/chinese_llama_plus_lora_7b \
    --output_type [pth|huggingface] \
    --output_dir home/winkids/mbp_python/llama_models/7B_full_model

// 多lora权重合并，传入两个lora
// **两个LoRA模型的顺序很重要，不能颠倒。先写LLaMA-Plus-LoRA然后写Alpaca-Plus/Pro-LoRA。并且两个路径中间只有一个逗号**
python DOWNLOAD/scripts/merge_llama_with_chinese_lora.py \
    --base_model /home/winkids/mbp_python/llama_models/7B_hf \
    --lora_model /home/winkids/mbp_python/llama_models/chinese_llama_plus_lora_7b,/home/winkids/mbp_python/llama_models/chinese_alpaca_pro_lora_7b\
    --output_type [pth|huggingface] \
    --output_dir home/winkids/mbp_python/llama_models/7B_full_model

—base_model：存放HF格式的LLaMA模型权重和配置文件的目录（Step 1生成）
—lora_model：中文LLaMA/Alpaca LoRA解压后文件所在目录
—output_type: 指定输出格式，可为.pth（pytorch权重文件）或.bin（huggingface权重文件）。若不指定，默认为pth
—output_dir：指定保存全量模型权重的目录，默认为./

注意：

优先使用新版合并脚本，所需内存显著降低，只需将以下命令中的脚本替换为scripts/merge_llama_with_chinese_lora_low_mem.py，参数相同。
此处可以选择输出PyTorch版本权重（.pth文件）或者输出HuggingFace版本权重（.bin文件）。请优先转为pth文件，比对合并后模型的SHA256无误后按需再转成HF格式。
.pth文件可用于：使用llama.cpp工具进行量化和部署
.bin文件可用于：使用Transformers进行推理、使用text-generation-webui搭建界面

运行：注意如果要使用API结构，只有huggingface格式的合并才有config.json文件

python /home/winkids/mbp_python/llama_models/Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora_low_mem.py \
    --base_model /home/winkids/mbp_python/llama_models/7B_hf \
    --lora_model /home/winkids/mbp_python/llama_models/chinese_llama_plus_lora_7b,/home/winkids/mbp_python/llama_models/chinese_alpaca_pro_lora_7b \
    --output_type huggingface \
    --output_dir /home/winkids/mbp_python/llama_models/7B_full_model_hf

输出：
Base model: /home/winkids/mbp_python/llama_models/7B_hf
LoRA model(s) ['/home/winkids/mbp_python/llama_models/chinese_llama_plus_lora_7b', '/home/winkids/mbp_python/llama_models/chinese_alpaca_pro_lora_7b']:
Loading /home/winkids/mbp_python/llama_models/chinese_llama_plus_lora_7b
Loading /home/winkids/mbp_python/llama_models/chinese_alpaca_pro_lora_7b
Loading ckpt pytorch_model-00001-of-00002.bin
Merging...
Saving ckpt pytorch_model-00001-of-00002.bin to /home/winkids/mbp_python/llama_models/7B_full_model_hf in HF format...
Loading ckpt pytorch_model-00002-of-00002.bin
Merging...
Saving ckpt pytorch_model-00002-of-00002.bin to /home/winkids/mbp_python/llama_models/7B_full_model_hf in HF format...
Saving tokenizer
Saving config.json
Saving generation_config.json
Saving pytorch_model.bin.index.json
Done.

1.5 SHA256校验

// In MacOS
shasum -a 256 your-model-file

// In Linux
sha256sum your-model-file

//In Windows
certutil -hashfile your-model-file sha256

不同model file的sha256结果（consolidated.*.pth）
https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md