分布式数据集编排和加速引擎fluid

机器人
摘要
lomtom

引言

在 AI 模型训练、大数据离线分析、实时数据处理等场景中,大规模数据集的存储访问效率是制约任务执行速度的关键瓶颈。传统存储方案中,数据集常存储于远端 PVC 或对象存储,任务访问时需频繁进行网络传输,不仅延迟高,还易受网络带宽波动影响。

Fluid 作为云原生分布式数据集编排与加速引擎,核心优势在于融合 Alluxio 分布式缓存技术,可将远端数据源的数据集缓存至计算节点本地内存或磁盘,实现数据 “就近访问”。其支持 PVC、对象存储、HDFS 等多种数据源,能显著降低数据访问延迟、提升任务执行效率,同时具备灵活的数据集管理与调度能力。

本文基于 Fluid v1.0.8 版本,详细介绍其在 Kubernetes 环境中的完整部署流程、数据集配置方法、缓存加速效果测试,同时补充 ARM64 架构镜像的手动构建方案(解决官方镜像适配问题),为多架构集群、多场景下的数据集加速提供可直接落地的实践参考。

环境部署

安装 Fluid(基于 Helm)

  1. 添加 Fluid Helm 仓库并更新索引:
helm repo add fluid https://fluid-cloudnative.github.io/charts
helm repo update
  1. 配置镜像前缀与版本参数:
DefaultImagePrefix=swr.cn-east-3.myhuaweicloud.com/lomtom-common
DefaultVersion=v1.1.0-f82c77c4
  1. 执行安装命令,指定命名空间、版本及镜像配置:
helm install fluid fluid/fluid -n fluid-system --create-namespace --version 1.0.8 \
--set imagePrefix=$DefaultImagePrefix \
--set crdUpgrade.imagePrefix=$DefaultImagePrefix \
--set dataset.controller.imagePrefix=$DefaultImagePrefix \
--set csi.registrar.imagePrefix=$DefaultImagePrefix \
--set csi.plugins.imagePrefix=$DefaultImagePrefix \
--set webhook.imagePrefix=$DefaultImagePrefix \
--set webhook.filePrefetcher.imagePrefix=$DefaultImagePrefix \
--set fluidapp.controller.imagePrefix=$DefaultImagePrefix \
--set runtime.alluxio.controller.imagePrefix=$DefaultImagePrefix \
--set runtime.alluxio.init.imagePrefix=$DefaultImagePrefix \
--set runtime.alluxio.runtime.imagePrefix=$DefaultImagePrefix \
--set runtime.alluxio.fuse.imagePrefix=$DefaultImagePrefix \
--set version=$DefaultVersion \
--set crdUpgrade.imageTag=$DefaultVersion \
--set dataset.controller.imageTag=$DefaultVersion \
--set csi.plugins.imageTag=$DefaultVersion \
--set webhook.imageTag=$DefaultVersion \
--set fluidapp.controller.imageTag=$DefaultVersion \
--set runtime.alluxio.controller.imageTag=$DefaultVersion \
--set runtime.alluxio.init.imageTag=v1.0.4 \
--set runtime.alluxio.runtime.imageTag=2.9.0-openeuler2403sp1 \
--set runtime.alluxio.fuse.imageTag=2.9.0-openeuler2403sp1

安装完成后,可通过 kubectl get pods -n fluid-system 验证组件状态,确保所有 Pod 处于 Running 状态。

镜像手动构建(ARM64 架构适配)

注:该步骤可跳过(上一步安装过程中为双架构镜像),若想自定义构建,可按照以下步骤进行构建。

由于 Fluid 与 Alluxio 官方仅提供 AMD64 架构镜像,针对 ARM64 架构集群(如鲲鹏服务器),需手动构建适配镜像,步骤如下:

构建 Fluid 镜像

Fluid 镜像基于 Golang 构建,通过多平台构建命令可同时生成 AMD64/ARM64 镜像:

# 克隆 Fluid 源码(指定 v1.0.8 版本)
git clone https://github.com/fluid-cloudnative/fluid.git -b v1.0.8

# 配置镜像仓库与构建参数
export IMG_REPO=swr.cn-east-3.myhuaweicloud.com/lomtom-common
export GO_MODULE=on
export DOCKER_PLATFORM=linux/amd64,linux/arm64

# 创建多平台构建环境并推送镜像
docker buildx create --use 
make docker-buildx-all-push 

构建 Alluxio 镜像

Alluxio 作为 Fluid 的核心缓存运行时,需构建基础镜像(alluxio)与开发镜像(alluxio-dev),基于 openEuler 24.03 LTS SP1 系统 🔗

  1. Alluxio 需构建基础镜像与开发镜像
# 克隆 Alluxio 源码(指定 v2.9.0 版本)
git clone https://github.com/Alluxio/alluxio.git -b v2.9.0
cd alluxio/integration/docker/

# 修改 Dockerfile:基础镜像替换为 openEuler,手动安装 fuse3-libs(详细Dockerfile见文章末尾)
# 1. 将基础镜像从 centos 改为 openeuler/openeuler:24.03-lts-sp1
# 2. 添加 fuse3-libs 安装命令(解决 openEuler 官方源缺失问题)

# 构建并推送多架构镜像
docker buildx build -t lomtom/alluxio:2.9.0-openeuler2403sp1 --platform linux/amd64,linux/arm64 . --push
  1. 构建 Alluxio-dev 镜像:
# 修改 Dockerfile-dev 基础镜像为上述构建的 Alluxio 镜像(详细Dockerfile见文章末尾)

# 构建并推送多架构镜像
docker buildx build -t lomtom/alluxio-dev:2.9.0-openeuler2403sp1 \
  -f Dockerfile-dev --platform linux/amd64,linux/arm64 . --push

测试环境准备

创建 PVC 与测试 Pod

本次测试基于 CephFS 存储(需提前部署 CephFS 集群并创建存储类 cephfs),通过 PVC 申请存储资源并挂载至测试 Pod:

  1. 创建 PVC(申请 20GB 存储,用于存储测试大文件):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fluid-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20G
  storageClassName: cephfs
  volumeMode: Filesystem
  1. 创建测试 Pod(挂载 PVC并长期运行,用于后续文件上传与访问测试):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fluid
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: fluid
      app.kubernetes.io/name: fluid
      serverless.fluid.io/inject: "true"
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: fluid
        app.kubernetes.io/name: fluid
        serverless.fluid.io/inject: "true"
    spec:
      containers:
        - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/busybox:1.37.0
          name: fluid
          command:
            - sh
            - -c
            - |
              sleep 365d
          volumeMounts:
            - mountPath: /data
              name: storage
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: fluid-pvc

验证 Pod 状态:

 kubectl get po
NAME                       READY   STATUS    RESTARTS   AGE
fluid-f8764d8d9-hsdnk      1/1     Running   0          85m
fluid1-587bfc8d65-zxc62    1/1     Running   0          87m

预期输出:fluid-f8764d8d9-hsdnk 处于 Running 状态,再以同样的方法创建新的Pod作为对照。

上传测试大文件

将 AI 模型文件(3.3GB)上传至测试 Pod 的 PVC 挂载目录,模拟实际场景中的大规模数据集:

# 复制本地大文件到 Pod 的 /data 目录(示例为 3.3GB 模型文件)
kubectl cp ./DeepSeek-R1-Distill-Qwen-1.5B/model.safetensors fluid-f8764d8d9-hsdnk:/data/

# 验证文件上传结果
kubectl exec -it fluid-f8764d8d9-hsdnk -- ls /data/ -lh
total 3G     
-rw-r--r--    1 root     root        3.3G Nov 13 06:37 model.safetensors

预期输出:model.safetensors 文件大小约 3.3GB,确认文件上传成功。

Fluid 数据集配置与缓存加速

创建 Dataset 与 AlluxioRuntime

Fluid 通过 Dataset 定义数据源信息,通过 AlluxioRuntime 配置缓存参数,两者名称必须保持一致:

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: pv-demo-dataset
spec:
  mounts:
    - mountPoint: pvc://fluid-pvc
      name: data
      path: /
  accessModes:
    - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: pv-demo-dataset
spec:
  replicas: 1
  data:
    replicas: 1
  tieredstore:
    levels:
      - path: /dev/shm
        mediumtype: MEM
        quota: 20Gi

触发数据预加载(可选)

默认情况下,Fluid 采用 “懒加载” 模式,首次访问数据时才开始缓存。为避免首次访问的缓存预热耗时,可通过 DataLoad 资源手动触发全量数据预加载:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: pv-demo-dataset
spec:
  dataset:
    name: pv-demo-dataset
    namespace: default

验证 Dataset 与 AlluxioRuntime 状态,确保缓存就绪:

# 查看 Dataset 状态(缓存进度、容量等)
kubectl get datasets.data.fluid.io 
NAME              UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
pv-demo-dataset   3.31GiB          0.00B    20.00GiB         0.0%                Bound   10m

# 查看 AlluxioRuntime 状态(Master/Worker/FUSE 组件状态)
kubectl get alluxioruntimes.data.fluid.io 
NAME              MASTER PHASE   WORKER PHASE   FUSE PHASE   AGE
pv-demo-dataset   Ready          Ready          Ready        10m

# 查看pod
kubectl get pod
NAME                       READY   STATUS    RESTARTS   AGE
fluid-f8764d8d9-hsdnk      1/1     Running   0          9m
fluid1-587bfc8d65-zxc62    1/1     Running   0          9m
pv-demo-dataset-master-0   2/2     Running   0          10m
pv-demo-dataset-worker-0   2/2     Running   0          10m

预期输出:

  • Dataset 状态为 BoundUFS TOTAL SIZE 和pod内/data目录大小一致;
  • AlluxioRuntime 的 Master、Worker、FUSE 相位均为 Ready

性能测试验证

通过对比未缓存与已缓存状态下的文件复制速度,验证 Fluid 缓存加速效果:

未缓存状态测试

在未触发缓存的两个 Pod 中分别复制大文件,记录耗时(若未进行缓存预加载,首次访问无缓存,依赖网络传输):

# 进入测试 Pod
kubectl exec -it fluid-f8764d8d9-hsdnk -- /bin/sh
kubectl exec -it fluid1-587bfc8d65-zxc62 -- /bin/sh

# 复制大文件,统计耗时
time cp /data/model.safetensors /root
real	0m 16.37s
user	0m 0.00s
sys		0m 2.12s

理论上,两个 Pod 首次访问速度相近,均受网络带宽限制。

已缓存状态测试

重建测试 Pod(清除系统 Page Cache 影响),再次测试文件复制速度(利用 Fluid 内存缓存):

# 删除旧的pod
kubectl delete deployment fluid
# 重建之前的pod
kubectl apply -f fluid-deployment.yaml

# 进入重建后的测试 Pod(已缓存加速)
kubectl exec -it fluid-f8764d8d9-hktpt -- /bin/sh
# 进入重建后的对照 Pod(未关联缓存,无加速)
kubectl exec -it fluid1-587bfc8d65-l2lsq -- /bin/sh

# 进行缓存加速的pod
time cp /data/model.safetensors /root
real	0m 1.91s
user	0m 0.00s
sys		0m 1.90s

# 查看数据集
NAME              UFS TOTAL SIZE   CACHED    CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
pv-demo-dataset   3.31GiB          3.31GiB   20.00GiB         100.0%              Bound   34m

预期结果:

  1. 进行缓存加速的Pod(fluid-f8764d8d9-hktpt)在经过Fluid 缓存加速后,3.3GB 大文件的复制耗时从 16.37 秒降至 1.91 秒,访问效率提升约 8.6 倍,缓存加速效果显著。
  2. 未进行缓存加速的Pod(fluid1-587bfc8d65-l2lsq)访问速度不如加速后的。

参考

  1. fluid 官方文档 🔗
  2. fluid 官方仓库 🔗
  3. alluxio 官方仓库 🔗
  4. 阿里云 ACK 文档:《加速 PV 存储卷数据访问》 🔗
  5. 阿里云 ACK 文档:《使用 Fluid 加速 Pod 启动》 🔗

Dockerfile(alluxio)

#
# The Alluxio Open Foundation licenses this work under the Apache License, version 2.0
# (the "License"). You may not use this work except in compliance with the License, which is
# available at www.apache.org/licenses/LICENSE-2.0
#
# This software is distributed on an "AS IS" basis, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
# either express or implied, as more fully set forth in the License.
#
# See the NOTICE file distributed with this work for information regarding copyright ownership.
#

# ARG defined before the first FROM can be used in FROM lines
# Only 8 and 11 are supported.
ARG JAVA_VERSION=8

# Setup CSI
FROM golang:1.15.13-alpine AS csi-dev
ENV GO111MODULE=on
RUN mkdir -p /alluxio-csi
COPY ./csi /alluxio-csi
RUN cd /alluxio-csi && \
    CGO_ENABLED=0 go build -o /usr/local/bin/alluxio-csi

# We have to do an ADD to put the tarball into extractor, then do a COPY with chown into final
# ADD then chown in two steps will double the image size
#   See - https://stackoverflow.com/questions/30085621/why-does-chown-increase-size-of-docker-image
#       - https://github.com/moby/moby/issues/5505
#       - https://github.com/moby/moby/issues/6119
# ADD with chown doesn't chown the files inside tarball
#   See - https://github.com/moby/moby/issues/35525
FROM alpine:3.10.2 AS alluxio-extractor
# Note that downloads for *-SNAPSHOT tarballs are not available.
ARG ALLUXIO_TARBALL=http://downloads.alluxio.io/downloads/files/2.9.0/alluxio-2.9.0-bin.tar.gz
# (Alert):It's not recommended to set this Argument to true, unless you know exactly what you are doing
ARG ENABLE_DYNAMIC_USER=false

ADD ${ALLUXIO_TARBALL} /opt/
# Remote tarball needs to be untarred. Local tarball is untarred automatically.
# Use ln -s instead of mv to avoid issues with Centos (see https://github.com/moby/moby/issues/27358)
RUN cd /opt && \
    (if ls | grep -q ".tar.gz"; then tar -xzf *.tar.gz && rm *.tar.gz; fi) && \
    ln -s alluxio-* alluxio

RUN if [ ${ENABLE_DYNAMIC_USER} = "true" ] ; then \
       chmod -R 777 /opt/* ; \
    fi

# Configure Java
FROM openeuler/openeuler:24.03-lts-sp1 as build_java8
RUN \
    yum update -y && yum upgrade -y && \
    yum install -y java-1.8.0-openjdk-devel java-1.8.0-openjdk && \
    yum clean all
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk
# Disable JVM DNS cache in java8 (https://github.com/Alluxio/alluxio/pull/9452)
RUN echo "networkaddress.cache.ttl=0" >> /usr/lib/jvm/java-1.8.0-openjdk/jre/lib/security/java.security

FROM openeuler/openeuler:24.03-lts-sp1 as build_java11
RUN \
    yum update -y && yum upgrade -y && \
    yum install -y java-11-openjdk-devel java-11-openjdk && \
    yum clean all
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk
# Disable JVM DNS cache in java11 (https://github.com/Alluxio/alluxio/pull/9452)
RUN echo "networkaddress.cache.ttl=0" >> /usr/lib/jvm/java-11-openjdk/conf/security/java.security

FROM build_java${JAVA_VERSION} AS final

WORKDIR /

# Install libfuse2 and libfuse3. Libfuse2 setup is modified from cheyang/fuse2:ubuntu1604-customize to be applied on centOS
RUN \
    yum install -y ca-certificates pkgconfig wget udev git && \
    yum install -y gcc gcc-c++ make cmake gettext-devel libtool autoconf && \
    git clone https://github.com/Alluxio/libfuse.git && \
    cd libfuse && \
    git checkout fuse_2_9_5_customize_multi_threads && \
    bash makeconf.sh && \
    ./configure && \
    make -j8 && \
    make install && \
    cd .. && \
    rm -rf libfuse
# https://developer.aliyun.com/packageSearch?word=fuse3-libs-3.6.1-4.1.al7
# 根据架构设置变量
ARG TARGETARCH
ARG TARGETVARIANT

# 根据架构选择不同的 RPM 包
RUN if [ "$TARGETARCH" = "amd64" ]; then \
        wget -O /fuse3-libs.rpm https://mirrors.aliyun.com/alinux/2.1903/extras/x86_64/Packages/fuse3-libs-3.6.1-4.1.al7.x86_64.rpm; \
    elif [ "$TARGETARCH" = "arm64" ]; then \
        wget -O /fuse3-libs.rpm https://mirrors.aliyun.com/alinux/2.1903/extras/aarch64/Packages/fuse3-libs-3.6.1-4.1.al7.aarch64.rpm; \
    fi
RUN rpm -ivh --nodeps /fuse3-libs.rpm && \
        rm -f /fuse3-libs.rpm
RUN yum remove -y gcc gcc-c++ make cmake gettext-devel libtool autoconf wget git && \
    yum install -y fuse3 fuse3-devel && \
    yum clean all

# Configuration for the modified libfuse2
ENV MAX_IDLE_THREADS "64"

# /lib64 is for rocksdb native libraries, /usr/local/lib is for libfuse2 native libraries
ENV LD_LIBRARY_PATH "/lib64:/usr/local/lib:${LD_LIBRARY_PATH}"

ARG ALLUXIO_USERNAME=alluxio
ARG ALLUXIO_GROUP=alluxio
ARG ALLUXIO_UID=1000
ARG ALLUXIO_GID=1000

# For dev image to know the user
ENV ALLUXIO_DEV_UID=${ALLUXIO_UID}

ARG ENABLE_DYNAMIC_USER=true

# Add Tini for Alluxio helm charts (https://github.com/Alluxio/alluxio/pull/12233)
# - https://github.com/krallin/tini
ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-static /usr/local/bin/tini
RUN chmod +x /usr/local/bin/tini

# If Alluxio user, group, gid, and uid aren't root|0, create the alluxio user and set file permissions accordingly
RUN if [ ${ALLUXIO_USERNAME} != "root" ] \
    && [ ${ALLUXIO_GROUP} != "root" ] \
    && [ ${ALLUXIO_UID} -ne 0 ] \
    && [ ${ALLUXIO_GID} -ne 0 ]; then \
      groupadd --gid ${ALLUXIO_GID} ${ALLUXIO_GROUP} && \
      useradd --system -m --uid ${ALLUXIO_UID} --gid ${ALLUXIO_GROUP} ${ALLUXIO_USERNAME} && \
      usermod -a -G root ${ALLUXIO_USERNAME} && \
      mkdir -p /journal && \
      chown -R ${ALLUXIO_UID}:${ALLUXIO_GID} /journal && \
      chmod -R g=u /journal && \
      mkdir /mnt/alluxio-fuse && \
      chown -R ${ALLUXIO_UID}:${ALLUXIO_GID} /mnt/alluxio-fuse; \
    fi

# Docker 19.03+ required to expand variables in --chown argument
# https://github.com/moby/buildkit/pull/926#issuecomment-503943557
COPY --from=alluxio-extractor --chown=${ALLUXIO_USERNAME}:${ALLUXIO_GROUP} /opt /opt/
COPY --chown=${ALLUXIO_USERNAME}:${ALLUXIO_GROUP} conf /opt/alluxio/conf/
COPY --chown=${ALLUXIO_USERNAME}:${ALLUXIO_GROUP} entrypoint.sh /
COPY --from=csi-dev /usr/local/bin/alluxio-csi /usr/local/bin/

RUN if [ ${ENABLE_DYNAMIC_USER} = "true" ] ; then \
       chmod -R 777 /journal; \
       chmod -R 777 /mnt; \
       # Enable user_allow_other option for fuse in non-root mode
       echo "user_allow_other" >> /etc/fuse.conf; \
    fi

USER ${ALLUXIO_UID}

WORKDIR /opt/alluxio

ENV PATH="/opt/alluxio/bin:${PATH}"

ENTRYPOINT ["/entrypoint.sh"]

Dockerfile(alluxio-dev)

#
# The Alluxio Open Foundation licenses this work under the Apache License, version 2.0
# (the "License"). You may not use this work except in compliance with the License, which is
# available at www.apache.org/licenses/LICENSE-2.0
#
# This software is distributed on an "AS IS" basis, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
# either express or implied, as more fully set forth in the License.
#
# See the NOTICE file distributed with this work for information regarding copyright ownership.
#

FROM lomtom/alluxio:2.9.0-openeuler2403sp1

USER root

RUN \
    yum update -y && yum upgrade -y && \
    yum install -y java-11-openjdk-devel java-11-openjdk && \
    yum install -y ca-certificates pkgconfig wget udev git gcc gcc-c++ make cmake gettext-devel libtool autoconf unzip vim && \
    yum clean all

# Create a symlink for setting JAVA_HOME depending on java version. ENV cannot be set conditionally.
RUN ln -s /usr/lib/jvm/java-1.8.0-openjdk /usr/lib/jvm/java-8-openjdk
ARG JAVA_VERSION=8
ENV JAVA_HOME /usr/lib/jvm/java-${JAVA_VERSION}-openjdk

# Disable JVM DNS cache in java11 (https://github.com/Alluxio/alluxio/pull/9452)
RUN echo "networkaddress.cache.ttl=0" >> /usr/lib/jvm/java-11-openjdk/conf/security/java.security

# Install arthas(https://github.com/alibaba/arthas) for analyzing performance bottleneck
RUN wget -qO /tmp/arthas.zip "https://github.com/alibaba/arthas/releases/download/arthas-all-3.4.6/arthas-bin.zip" && \
    mkdir -p /opt/arthas && \
    unzip /tmp/arthas.zip -d /opt/arthas && \
    rm /tmp/arthas.zip

# 根据架构设置变量
ARG TARGETARCH

# Install async-profiler(https://github.com/jvm-profiling-tools/async-profiler/releases/tag/v1.8.3)
RUN if [ "$TARGETARCH" = "amd64" ]; then \
        wget -qO /tmp/async-profiler-1.8.3-linux-x64.tar.gz "https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-x64.tar.gz" && \
        tar -xvf /tmp/async-profiler-1.8.3-linux-x64.tar.gz -C /opt && \
        mv /opt/async-profiler-* /opt/async-profiler && \
        rm /tmp/async-profiler-1.8.3-linux-x64.tar.gz; \
     elif [ "$TARGETARCH" = "arm64" ]; then \
        wget -qO /tmp/async-profiler-1.8.3-linux-arm64.tar.gz "https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-arm64.tar.gz" && \
        tar -xvf /tmp/async-profiler-1.8.3-linux-arm64.tar.gz -C /opt && \
        mv /opt/async-profiler-* /opt/async-profiler && \
        rm /tmp/async-profiler-1.8.3-linux-arm64.tar.gz; \
     fi
USER ${ALLUXIO_DEV_UID}

标题:分布式数据集编排和加速引擎fluid

作者:lomtom

链接:https://lomtom.cn/ddwkjej7r12wp