如何离线运行大模型StarCoder

December 20, 2023
- 大模型

摘要

lomtom

生成中...

由于业务场景的特殊性，需要实现大模型的部署，而在上一篇文章如何运行Hugging Face大模型StarCoder 中介绍的方式涉及到访问huggingface.co，但由于网络限制，无法实现访问该网站。因此，为了实现完全离线的部署，即在运行时不依赖于任何外部网络，可以选择以下两种方法进行部署：

使用text-generation-inference

text-generation-inference是一个适用于Hugging Face大模型的轻量级推理服务。通过使用这个工具，您可以在本地环境中部署大模型，并在离线状态下运行推理。这种方法对于避免对外部网络的依赖非常有效。
使用transformers

transformers库是Hugging Face提供的用于自然语言处理模型的库，包括各种大型预训练模型。通过将模型下载并保存在本地，您可以在运行时不需要访问外部网络。这可以通过提前下载所需的模型权重和配置文件，然后在离线环境中加载它们来实现。

选择哪种方法取决于您的具体需求和业务场景。text-generation-inference提供了一种简单而直接的离线推理解决方案，而transformers库则为更多的灵活性和自定义性提供了支持。根据具体的业务和部署需求，可以选择适合的方式进行离线部署。

注意：此篇文章以已经下载模型到本地为前提，若想要下载模型到本地，可以参考上一篇文章如何运行Hugging Face大模型StarCoder。

使用text-generation-inference

这是Hugging Face的一个项目，旨在部署和服务大模型。在上一篇文章中，我们采用了text-generation-inference进行部署，并通过将模型下载到本地来实现纯离线部署。为了进一步简化此过程，只需在运行命令中添加参数-e HF_HUB_OFFLINE=1即可实现大模型的纯离线部署。

完整运行命令如下：

docker run --privileged=true -p 8080:80 --gpus all -v /root/huggingface/starCoder/data:/data -e HUGGING_FACE_HUB_TOKEN=YOUR-API-KEY  -e HF_HUB_OFFLINE=1  -d  ghcr.io/huggingface/text-generation-inference:1.3.3 --model-id bigcode/starcoder --max-total-tokens 8192

当日志打印如下内容时，说明无需访问Hugging Face，即可在离线环境中运行大模型：

INFO text_generation_launcher: Args { model_id: "bigcode/starcoder", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "693b5c1038c0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
INFO download: text_generation_launcher: Starting download process.
INFO text_generation_launcher: Files are already present on the host. Skipping download.

使用transformers

Transformers 🔗是Hugging Face开发的一款用于自然语言处理（NLP）的模型库。该库提供了一系列预训练的Transformer模型，这些模型在各种NLP任务中都表现出色。Transformers库支持PyTorch和TensorFlow，使其可以方便地在不同的深度学习框架中使用。

TensorFlow 🔗则是由Google开发的一款开源机器学习框架，专为构建和训练各类机器学习模型而设计。TensorFlow提供了丰富的工具和资源，支持深度学习、神经网络以及其他机器学习任务。其强大的性能和灵活性使得TensorFlow成为训练和部署大型模型的首选框架之一。

而PyTorch 🔗是Facebook开发的另一款深度学习框架，现在归属于Linux基金会旗下，被广泛用于学术界和工业界。PyTorch提供了直观的动态计算图，使得模型的构建和调试更加灵活直观。PyTorch在深度学习领域具有活跃的社区，并且被广泛用于研究和开发中。

如果您希望在离线环境中运行大型模型StarCoder，您可以选择使用transformers。不过，值得注意的是，使用transformers需要依赖PyTorch或TensorFlow。您可以选择在主机上安装PyTorch或TensorFlow并直接在主机上运行模型，或者在容器中安装PyTorch或TensorFlow。以下是在容器中安装PyTorch和transformers的步骤：

构建基础镜像

使用tensorflow/tensorflow:2.15.0作为基础镜像，并在基础镜像中安装PyTorch和transformers。以下是Dockerfile的内容：

FROM tensorflow/tensorflow:2.15.0
RUN pip install torch torchvision torchaudio  -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

构建镜像

执行命令docker build -t tensorflow/tensorflow:2.15.0-torch .来构建镜像。

编写transformers代码

编辑vi /root/huggingface/starCoder-1/main.py，代码如下：

from transformers import pipeline

# Hugging Face model checkpoint for code completion
CHECKPOINT = "bigcode/starcoder"
# Device configuration: 0 for GPU, -1 for CPU
DEVICE = 0

prompt = "<fim_prefix><filename>test2.js\nconst a = 'a'\n\nconsole.log(<fim_suffix>)<fim_middle>"

# Initialize code generation pipeline
code_generator = pipeline("text-generation", model=CHECKPOINT, do_sample=True, device=DEVICE)

# Generation parameters
generation_parameters = {
    "top_p": 0.9,
    "return_full_text": False,
    "num_return_sequences": 1,
    "max_new_tokens": 256,
    "temperature": 0.2,
    "repetition_penalty": 1.2
}

# Generate code completion
generated_output = code_generator(prompt, **generation_parameters)

# Print the generated output
print(generated_output)

代码的第一部分设置了必要的变量。

CHECKPOINT变量设置为”bigcode/starcoder”，这是将用于文本生成的预训练模型的标识符。
DEVICE变量设置为0，这意味着如果有可用的CUDA设备，模型将在第一个CUDA设备上运行。如果device设置为-1，模型将在CPU上运行。
prompt变量设置为一个字符串，该字符串将用作文本生成的输入。

接下来，transformers库的pipeline函数用于创建一个带有指定模型和设备的文本生成管道。这个管道将用于根据输入提示生成文本。 gen_kwargs字典设置了将用于文本生成的参数。

最后，调用generator管道，输入提示和参数，然后打印输出。输出将是基于输入提示和参数生成的文本。

运行容器

执行以下命令，将编写的代码运行起来：

docker run --rm --gpus all -v /root/.cache/huggingface:/root/.cache/huggingface -v /root/huggingface/starCoder-1:/root/huggingface/starCoder-1 \
   --workdir /root/huggingface/starCoder-1 -it tensorflow/tensorflow:2.15.0-torch python main.py

这里挂载了两个目录，一个是/root/.cache/huggingface，这个目录是transformers默认缓存/读取模型的，另一个是/root/huggingface/starCoder-1，这个目录是用来存放代码的，这里的代码是在容器中运行的，所以需要将代码挂载到容器中。运行结果如下：

I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading checkpoint shards: 100%|████████████████████████████████████████████| 7/7 [01:46<00:00, 15.19s/it]
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
[{'generated_text': 'a'}]

可以看到，成功运行了大模型StarCoder，并且代码补全结果也是正确的。

当然，如果每次执行都需加载模型，这样会导致响应速度非常的缓慢，这显然是不现实的。为了解决这个问题，我们可以将模型加载到内存中，再结合其他的Web框架构建一个API接口，每次调用大模型都调用这个接口，这样就可以实现大模型的快速响应。

编写FastAPI代码

FastAPI是一个现代、高性能的Web框架，适用于构建API。它基于Python 3.8+，设计简单而灵活，以满足各种需求。FastAPI采用异步框架，核心功能基于Starlette，同时利用Pydantic进行数据验证和序列化。

所以使用FastAPI来作为大模型需要的Web框架是比较合适的，当然你也可以使用其他的Web框架来实现，比如Flask、Django等。

使用FastAPI结合前面的transformers代码，最终/root/huggingface/starCoder-1/main.py的代码如下：

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

# Hugging Face model checkpoint for code completion
CHECKPOINT = "bigcode/starcoder"
# Device configuration: 0 for GPU, -1 for CPU
DEVICE = 0

# Initialize code generation pipeline
code_generator = pipeline("text-generation", model=CHECKPOINT, do_sample=True, device=DEVICE)


# Model parameters, customizable as needed
class CodeGenerationParameters(BaseModel):
    top_p: float
    return_full_text: bool
    num_return_sequences: int
    max_new_tokens: int
    temperature: float
    repetition_penalty: float


# Input data for code generation model
class CodeGenerationInput(BaseModel):
    code_input: str
    generation_parameters: CodeGenerationParameters


# FastAPI application instance
app = FastAPI()


@app.post("/")
async def generate_code(input_data: CodeGenerationInput):
    """
    Endpoint to generate code completion based on the provided input and generation parameters.

    Parameters:
    - input_data: CodeGenerationInput object containing code input and generation parameters.

    Returns:
    - Generated code completion.
    """
    # Extract input values
    input_code = input_data.code_input
    generation_params = input_data.generation_parameters

    # Prepare keyword arguments for code generation
    generation_kwargs = dict(
        top_p=generation_params.top_p,
        return_full_text=generation_params.return_full_text,
        num_return_sequences=generation_params.num_return_sequences,
        max_new_tokens=generation_params.max_new_tokens,
        temperature=generation_params.temperature,
        repetition_penalty=generation_params.repetition_penalty
    )

    # Generate code completion using the Hugging Face model
    generated_code = code_generator(input_code, **generation_kwargs)

    return generated_code

这里定义了一个/的接口，接口的输入是CodeGenerationInput，用于接收输入的代码和生成参数，接口的输出是generated_code，用于返回生成的代码。

优化基础镜像

由于代码中以及运行需要使用FastAPI、Uvicorn，所以需要在基础镜像中安装FastAPI及Uvicorn，所以需要优化基础镜像，Dockerfile如下：

FROM tensorflow/tensorflow:2.15.0
RUN pip install torch torchvision torchaudio  -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install uvicorn fastapi -i https://pypi.tuna.tsinghua.edu.cn/simple

重新构建镜像

执行命令docker build -t tensorflow/tensorflow:2.15.0-torch .构建镜像

运行容器

在 FastAPI 应用程序中，通常使用 Uvicorn 来运行和部署应用，以提供高性能的异步服务器，以应对高并发请求的需求。为了运行一个 FastAPI 的 Web 程序，你可以使用以下命令：

uvicorn main:app --host 0.0.0.0 --port 80

这条命令的含义是运行名为 main 的模块中的 app 对象，该对象是 FastAPI 的实例。通过 --host 参数指定主机地址为 0.0.0.0，表示允许从任何 IP 地址访问，而 --port 参数指定了端口号为 80。这样，你就可以通过指定的主机和端口访问部署好的 FastAPI 应用程序。

所以最终的docker执行命令如下：

docker run --gpus all -p 8080:80 -v /root/.cache/huggingface:/root/.cache/huggingface -v /root/huggingface/starCoder-1:/root/huggingface/starCoder-1 \
-d --workdir /root/huggingface/starCoder-1 -it tensorflow/tensorflow:2.15.0-torch uvicorn main:app --host 0.0.0.0 --port 80

当日志打印以下内容时，说明模型加载成功，并且API已经启动成功了

I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loading checkpoint shards: 100%|████████████████████████████████████████████| 7/7 [01:44<00:00, 14.86s/it]
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

测试API

如果需要测试API进行验证，可以请求容器地址或本机地址，发现正常响应，能够返回大模型处理的结果：

# 请求
curl -i --request POST 'http://localhost:8080' --header 'Content-Type: application/json' --header 'Accept: */*' --header 'Connection: keep-alive' --data '{
    "inputs": "<fim_prefix><filename>test2.js\nconst a = '\''a'\''\n\nconsole.log(<fim_suffix>)<fim_middle>",
    "parameters": {
        "top_p": 0.9,
        "return_full_text": false,
        "num_return_sequences": 1,
        "max_new_tokens": 256,
        "temperature": 0.2,
        "repetition_penalty": 1.2
    }
}'
# 响应
HTTP/1.1 200 OK
server: uvicorn
content-length: 24
content-type: application/json

[{"generated_text":"a"}]

总结

本文介绍了两种离线运行大模型StarCoder的方法，分别是使用text-generation-inference和transformers。

前者是Hugging Face提供的一个项目，旨在部署和服务大模型。后者是Hugging Face提供的一个用于自然语言处理（NLP）的模型库，该库提供了一系列预训练的Transformer模型，这些模型在各种NLP任务中都表现出色。

通过将模型下载并保存在本地，可以在运行时不需要访问外部网络。这可以通过提前下载所需的模型权重和配置文件，然后在离线环境中加载它们来实现。根据具体的业务和部署需求，可以选择适合的方式进行离线部署。

Depencies：

Hugging Face Text generation strategies 🔗

Pipelines of transformers 🔗

Pytorch 🔗

Running a model from the Docker container without internet connectivity 🔗