diff --git a/README.md b/README.md index ee2852cc93dc095e6bc346210a90d5ec0c07170a..ee27af8bfb2c93ceb095f83b863f0f7289df402b 100644 --- a/README.md +++ b/README.md @@ -14,9 +14,9 @@ openYuanrong 由多语言函数运行时、函数系统和数据系统组成, openYuanrong 分为三个代码仓库:[yuanrong-runtime](https://gitee.com/openeuler/yuanrong-runtime) 对应多语言函数运行时;[yuanrong-functionsystem](https://gitee.com/openeuler/yuanrong-functionsystem) 对应函数系统;[yuanrong-datasystem](https://gitee.com/openeuler/yuanrong-datasystem) 对应数据系统,即当前代码仓。 -**数据系统**是 openYuanrong 的核心概念抽象,是一款专为 AI 训推场景设计的分布式异构缓存系统。支持 HBM/DDR/SSD 异构介质池化缓存及 NPU 间异步并发高效数据传输,用于分布式 KVCache 缓存、模型参数缓存、高性能 replaybuffer 等场景。 +**数据系统(openYuanrong datasystem)**是 openYuanrong 的核心概念抽象,是一款专为 AI 训推场景设计的分布式异构缓存系统。支持 HBM/DDR/SSD 异构介质池化缓存及 NPU 间异步并发高效数据传输,用于分布式 KVCache 缓存、模型参数缓存、高性能 replaybuffer 等场景。 -yuanrong-datasystem 的主要特性包括: +openYuanrong datasystem 的主要特性包括: - **高性能分布式多级缓存**:基于 DRAM/SSD 构建分布式多级缓存,应用实例通过共享内存免拷贝读写 DRAM 数据,并提供高性能 H2D(host to device)/D2H(device to host) 接口,实现 HBM 与 DRAM 之间快速 swap。 - **NPU 间高效数据传输**:将 NPU 的 HBM 抽象为异构对象,自动协调 NPU 间 HCCL 收发顺序,实现简单易用的卡间数据异步并发传输。并支持P2P传输负载均衡策略,充分利用卡间链路带宽。 @@ -27,30 +27,30 @@ yuanrong-datasystem 的主要特性包括: - **数据发布订阅**:支持数据订阅发布,解耦数据的生产者(发布者)和消费者(订阅者),实现数据的异步传输与共享。 - **高可靠高可用**:支持分布式元数据管理,实现系统水平线性扩展。支持元数据可靠性,支持动态资源伸缩自动迁移数据,实现系统高可用。 -### yuanrong-datasystem 适用场景 +### openYuanrong datasystem 适用场景 - **LLM 长序列推理 KVCache**:基于异构对象提供分布式多级缓存 (HBM/DRAM/SSD) 和高吞吐 D2D/H2D/D2H 访问能力,构建分布式 KV Cache,实现 Prefill 阶段的 KVCache 缓存以及 Prefill/Decode 实例间 KV Cache 快速传递,提升推理吞吐。 - **模型推理实例 M->N 快速弹性**:利用异构对象的卡间直通及 P2P 数据分发能力实现模型参数快速复制。 - **强化学习模型参数重排**:利用异构对象的卡间直通传输能力,快速将模型参数从训练侧同步到推理侧。 - **训练场景 CheckPoint 快速保存及加载**:基于 KV 接口快速写 Checkpoint,并支持将数据持久化到二级缓存保证数据可靠性。Checkpoint恢复时各节点将 Checkpoint 分片快速加载到异构对象中,利用异构对象的卡间直通传输及 P2P 数据分发能力,快速将 Checkpoint 传递到各节点 HBM。 -### yuanrong-datasystem 架构 +### openYuanrong datasystem 架构 ![](./docs/source_zh_cn/getting-started/image/logical_architecture.png) -yuanrong-datasystem 由三个部分组成: +openYuanrong datasystem 由三个部分组成: - **多语言SDK**:提供 Python/C++ 语言接口,封装 heterogeneous object 及 KV 接口,支撑业务实现数据快速读写。提供两种类型接口: - **heterogeneous object**:基于 NPU 卡的 HBM 内存抽象异构对象接口,实现昇腾 NPU 卡间数据高速直通传输。同时提供 H2D/D2H 高速迁移接口,实现数据快速在 DRAM/HBM 之间传输。 - **KV**:基于共享内存实现免拷贝的 KV 接口,实现高性能数据缓存,支持通过对接外部组件提供数据可靠性语义。 -- **worker**:yuanrong-datasystem 的核心组件,用于分配管理 DRAM/SSD 资源以及元数据,提供分布式多级缓存能力。 +- **worker**:openYuanrong datasystem 的核心组件,用于分配管理 DRAM/SSD 资源以及元数据,提供分布式多级缓存能力。 - **集群管理**:依赖 ETCD,实现节点发现/健康检测,支持故障恢复及在线扩缩容。 ![](./docs/source_zh_cn/getting-started/image/deployment.png) -yuanrong-datasystem 的部署视图如上图所示: +openYuanrong datasystem 的部署视图如上图所示: - 需部署 ETCD 用于集群管理。 - 每个节点需部署 worker 进程并注册到 ETCD。 @@ -59,68 +59,68 @@ yuanrong-datasystem 的部署视图如上图所示: 各组件间的数据传输协议如下: - SDK 与 worker 之间通过共享内存读写数据。 -- worker 和 worker 之间通过 TCP/RDMA 传输数据(当前版本仅支持 TCP,后续版本支持 RDMA )。 +- worker 和 worker 之间通过 TCP/RDMA 传输数据(当前版本仅支持 TCP,RDMA/UB 即将支持)。 - 异构对象 HBM 之间通过 HCCS/RoCE 卡间直通传输数据。 ## 入门 -### 安装 yuanrong-datasystem +### 安装 openYuanrong datasystem #### pip 方式安装 -- 安装 yuanrong-datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): +- 安装 openYuanrong datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): ```bash - pip install pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl ``` -- 仅安装 yuanrong-datasystem Python SDK(不包含C++ SDK以及命令行工具): +- 仅安装 openYuanrong datasystem Python SDK(不包含C++ SDK以及命令行工具): ```bash - pip install pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem_sdk-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem_sdk-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl ``` #### 源码编译方式安装 -使用源码编译方式安装 yuanrong-datasystem 可以参考文档:[源码编译安装 yuanrong-datasystem](./docs/source_zh_cn/getting-started/install.md#源码编译方式安装yuanrong-datasystem版本) +使用源码编译方式安装 openYuanrong datasystem 可以参考文档:[源码编译安装 openYuanrong datasystem](./docs/source_zh_cn/getting-started/install.md#源码编译方式安装openyuanrong-datasystem版本) -### 部署 yuanrong-datasystem +### 部署 openYuanrong datasystem #### 进程部署 - 准备ETCD - yuanrong-datasystem 的集群管理依赖 ETCD,请先在后台启动单节点 ETCD(示例端口 2379): + openYuanrong datasystem 的集群管理依赖 ETCD,请先在后台启动单节点 ETCD(示例端口 2379): ```bash etcd --listen-client-urls http://0.0.0.0:2379 \ - --advertise-client-urls http://localhost:2379 + --advertise-client-urls http://localhost:2379 & ``` -- 一键启动集群 +- 一键部署 - 安装 yuanrong-datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具一键完成集群部署。启动一个监听端口号为 31501 的单机集群: + 安装 openYuanrong datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具一键完成集群部署。在当前启动一个监听端口号为 31501 的服务端进程: ```bash dscli start -w --worker_address "127.0.0.1:31501" --etcd_address "127.0.0.1:2379" ``` -- 集群卸载 +- 一键卸载 ```bash dscli stop --worker_address "127.0.0.1:31501" ``` -更多进程部署参数与部署方式请参考文档:[yuanrong-datasystem 进程部署](./docs/source_zh_cn/getting-started/deploy.md#yuanrong-datasystem进程部署) +更多进程部署参数与部署方式请参考文档:[openYuanrong datasystem 进程部署](./docs/source_zh_cn/getting-started/deploy.md#openYuanrong datasystem进程部署) #### Kubernetes 部署 -yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署前请确保部署环境集群已就绪 Kubernetes、Helm 及可访问的 ETCD 集群。 +openYuanrong datasystem 还提供了基于 Kubernetes 容器化部署方式,部署前请确保部署环境集群已就绪 Kubernetes、Helm 及可访问的 ETCD 集群。 -- 获取 yuanrong-datasystem helm chart 包 +- 获取 openYuanrong datasystem helm chart 包 - 安装 yuanrong-datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具在当前路径下快速获取 helm chart 包: + 安装 openYuanrong datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具在当前路径下快速获取 helm chart 包: ``` dscli generate_helm_chart -o ./ ``` - 编辑集群部署配置 - yuanrong-datasystem 通过 ./datasystem/values.yaml 文件进行集群相关配置,其中必配项如下: + openYuanrong datasystem 通过 ./datasystem/values.yaml 文件进行集群相关配置,其中必配项如下: ```yaml global: @@ -130,7 +130,7 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 imageRegistry: "swr.cn-south-1.myhuaweicloud.com/openeuler/" # 镜像名字和镜像tag images: - datasystem: "yr-datasystem:0.5.0-alpha" + datasystem: "openYuanrong-datasystem:0.5.0" etcd: # ETCD集群地址 @@ -139,41 +139,34 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 - 集群部署 - Helm 会提交 DaemonSet,按节点依次拉起 yuanrong-datasystem 实例: + Helm 会提交 DaemonSet,按节点依次拉起 openYuanrong datasystem 实例: ```bash - helm install yr_datasystem ./datasystem + helm install openyuanrong_datasystem ./datasystem ``` - 集群卸载 ```bash - helm uninstall yr_datasystem + helm uninstall openyuanrong_datasystem ``` -更多 yuanrong-datasystem Kubernetes 高级参数配置请参考文档:[yuanrong-datasystem Kubernetes 部署](./docs/source_zh_cn/getting-started/deploy.md#yuanrong-datasystem-kubernetes部署) +更多 openYuanrong datasystem Kubernetes 高级参数配置请参考文档:[openYuanrong datasystem Kubernetes 部署](./docs/source_zh_cn/getting-started/deploy.md#openyuanrong-datasystem-kubernetes部署) ### 代码样例 - 异构对象 - 通过异构对象接口实现 HBM 数据零拷贝发布/订阅 + 通过异构对象接口,将任意二进制数据以键值对形式写入 HBM: ```python import acl - import random - from datasystem.ds_client import DsClient - - def random_str(slen=10): - seed = "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@#%^*()_+=-" - sa = [] - for _ in range(slen): - sa.append(random.choice(seed)) - return ''.join(sa) + import os + from datasystem import Blob, DsClient, DeviceBlobList - # hetero_dev_publish and hetero_dev_subscribe must be executed in different processes + # hetero_dev_mset and hetero_dev_mget must be executed in different processes # because they need to be bound to different NPUs. - def hetero_dev_publish(): + def hetero_dev_mset(): client = DsClient("127.0.0.1", 31501) client.init() @@ -183,7 +176,7 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 key_list = [ 'key1', 'key2', 'key3' ] data_size = 1024 * 1024 - test_value = random_str(data_size) + test_value = "value" in_data_blob_list = [] for _ in key_list: @@ -195,11 +188,9 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 tmp_batch_list.append(blob) blob_list = DeviceBlobList(device_idx, tmp_batch_list) in_data_blob_list.append(blob_list) - pub_futures = client.hetero().dev_publish(key_list, in_data_blob_list) - for future in pub_futures: - future.get() + client.hetero().dev_mset(key_list, in_data_blob_list) - def hetero_dev_subscribe(): + def hetero_dev_mget(): client = DsClient("127.0.0.1", 31501) client.init() @@ -218,14 +209,21 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 tmp_batch_list.append(blob) blob_list = DeviceBlobList(device_idx, tmp_batch_list) out_data_blob_list.append(blob_list) - sub_futures = client.hetero().dev_subscribe(key_list, out_data_blob_list) - for future in sub_futures: - future.get() + client.hetero().dev_mget(key_list, out_data_blob_list, 60000) + client.hetero().dev_delete(key_list) + + pid = os.fork() + if pid == 0: + hetero_dev_mset() + os._exit(0) + else: + hetero_dev_mget() + os.wait() ``` - KV - 通过 KV 接口,将任意二进制数据以键值对形式写入全局分布式缓存: + 通过 KV 接口,将任意二进制数据以键值对形式写入 DDR: ```python from datasystem.ds_client import DsClient @@ -245,12 +243,9 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 ## 文档 -有关 yuanrong-datasystem 安装指南、教程和 API 的更多详细信息,请参阅 [用户文档](docs) - -查看 [openYuanrong 文档](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/index.html) 了解如何使用 openYuanrong 开发分布式应用。 +有关 openYuanrong datasystem 安装指南、教程和 API 的更多详细信息,请参阅 [用户文档](docs)。 -- 安装:`pip install openyuanrong`,[更多安装信息](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/deploy/installation.html)。 -- [快速入门](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/getting_started.html) +有关 openYuanrong 更多详细信息请参阅 [openYuanrong 文档](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/index.html),了解如何使用 openYuanrong 开发分布式应用。 ## 贡献 diff --git a/README_CN.md b/README_CN.md index ee2852cc93dc095e6bc346210a90d5ec0c07170a..ee27af8bfb2c93ceb095f83b863f0f7289df402b 100644 --- a/README_CN.md +++ b/README_CN.md @@ -14,9 +14,9 @@ openYuanrong 由多语言函数运行时、函数系统和数据系统组成, openYuanrong 分为三个代码仓库:[yuanrong-runtime](https://gitee.com/openeuler/yuanrong-runtime) 对应多语言函数运行时;[yuanrong-functionsystem](https://gitee.com/openeuler/yuanrong-functionsystem) 对应函数系统;[yuanrong-datasystem](https://gitee.com/openeuler/yuanrong-datasystem) 对应数据系统,即当前代码仓。 -**数据系统**是 openYuanrong 的核心概念抽象,是一款专为 AI 训推场景设计的分布式异构缓存系统。支持 HBM/DDR/SSD 异构介质池化缓存及 NPU 间异步并发高效数据传输,用于分布式 KVCache 缓存、模型参数缓存、高性能 replaybuffer 等场景。 +**数据系统(openYuanrong datasystem)**是 openYuanrong 的核心概念抽象,是一款专为 AI 训推场景设计的分布式异构缓存系统。支持 HBM/DDR/SSD 异构介质池化缓存及 NPU 间异步并发高效数据传输,用于分布式 KVCache 缓存、模型参数缓存、高性能 replaybuffer 等场景。 -yuanrong-datasystem 的主要特性包括: +openYuanrong datasystem 的主要特性包括: - **高性能分布式多级缓存**:基于 DRAM/SSD 构建分布式多级缓存,应用实例通过共享内存免拷贝读写 DRAM 数据,并提供高性能 H2D(host to device)/D2H(device to host) 接口,实现 HBM 与 DRAM 之间快速 swap。 - **NPU 间高效数据传输**:将 NPU 的 HBM 抽象为异构对象,自动协调 NPU 间 HCCL 收发顺序,实现简单易用的卡间数据异步并发传输。并支持P2P传输负载均衡策略,充分利用卡间链路带宽。 @@ -27,30 +27,30 @@ yuanrong-datasystem 的主要特性包括: - **数据发布订阅**:支持数据订阅发布,解耦数据的生产者(发布者)和消费者(订阅者),实现数据的异步传输与共享。 - **高可靠高可用**:支持分布式元数据管理,实现系统水平线性扩展。支持元数据可靠性,支持动态资源伸缩自动迁移数据,实现系统高可用。 -### yuanrong-datasystem 适用场景 +### openYuanrong datasystem 适用场景 - **LLM 长序列推理 KVCache**:基于异构对象提供分布式多级缓存 (HBM/DRAM/SSD) 和高吞吐 D2D/H2D/D2H 访问能力,构建分布式 KV Cache,实现 Prefill 阶段的 KVCache 缓存以及 Prefill/Decode 实例间 KV Cache 快速传递,提升推理吞吐。 - **模型推理实例 M->N 快速弹性**:利用异构对象的卡间直通及 P2P 数据分发能力实现模型参数快速复制。 - **强化学习模型参数重排**:利用异构对象的卡间直通传输能力,快速将模型参数从训练侧同步到推理侧。 - **训练场景 CheckPoint 快速保存及加载**:基于 KV 接口快速写 Checkpoint,并支持将数据持久化到二级缓存保证数据可靠性。Checkpoint恢复时各节点将 Checkpoint 分片快速加载到异构对象中,利用异构对象的卡间直通传输及 P2P 数据分发能力,快速将 Checkpoint 传递到各节点 HBM。 -### yuanrong-datasystem 架构 +### openYuanrong datasystem 架构 ![](./docs/source_zh_cn/getting-started/image/logical_architecture.png) -yuanrong-datasystem 由三个部分组成: +openYuanrong datasystem 由三个部分组成: - **多语言SDK**:提供 Python/C++ 语言接口,封装 heterogeneous object 及 KV 接口,支撑业务实现数据快速读写。提供两种类型接口: - **heterogeneous object**:基于 NPU 卡的 HBM 内存抽象异构对象接口,实现昇腾 NPU 卡间数据高速直通传输。同时提供 H2D/D2H 高速迁移接口,实现数据快速在 DRAM/HBM 之间传输。 - **KV**:基于共享内存实现免拷贝的 KV 接口,实现高性能数据缓存,支持通过对接外部组件提供数据可靠性语义。 -- **worker**:yuanrong-datasystem 的核心组件,用于分配管理 DRAM/SSD 资源以及元数据,提供分布式多级缓存能力。 +- **worker**:openYuanrong datasystem 的核心组件,用于分配管理 DRAM/SSD 资源以及元数据,提供分布式多级缓存能力。 - **集群管理**:依赖 ETCD,实现节点发现/健康检测,支持故障恢复及在线扩缩容。 ![](./docs/source_zh_cn/getting-started/image/deployment.png) -yuanrong-datasystem 的部署视图如上图所示: +openYuanrong datasystem 的部署视图如上图所示: - 需部署 ETCD 用于集群管理。 - 每个节点需部署 worker 进程并注册到 ETCD。 @@ -59,68 +59,68 @@ yuanrong-datasystem 的部署视图如上图所示: 各组件间的数据传输协议如下: - SDK 与 worker 之间通过共享内存读写数据。 -- worker 和 worker 之间通过 TCP/RDMA 传输数据(当前版本仅支持 TCP,后续版本支持 RDMA )。 +- worker 和 worker 之间通过 TCP/RDMA 传输数据(当前版本仅支持 TCP,RDMA/UB 即将支持)。 - 异构对象 HBM 之间通过 HCCS/RoCE 卡间直通传输数据。 ## 入门 -### 安装 yuanrong-datasystem +### 安装 openYuanrong datasystem #### pip 方式安装 -- 安装 yuanrong-datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): +- 安装 openYuanrong datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): ```bash - pip install pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl ``` -- 仅安装 yuanrong-datasystem Python SDK(不包含C++ SDK以及命令行工具): +- 仅安装 openYuanrong datasystem Python SDK(不包含C++ SDK以及命令行工具): ```bash - pip install pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem_sdk-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem_sdk-0.5.0-cp39-cp39-manylinux_2_34_x86_64.whl ``` #### 源码编译方式安装 -使用源码编译方式安装 yuanrong-datasystem 可以参考文档:[源码编译安装 yuanrong-datasystem](./docs/source_zh_cn/getting-started/install.md#源码编译方式安装yuanrong-datasystem版本) +使用源码编译方式安装 openYuanrong datasystem 可以参考文档:[源码编译安装 openYuanrong datasystem](./docs/source_zh_cn/getting-started/install.md#源码编译方式安装openyuanrong-datasystem版本) -### 部署 yuanrong-datasystem +### 部署 openYuanrong datasystem #### 进程部署 - 准备ETCD - yuanrong-datasystem 的集群管理依赖 ETCD,请先在后台启动单节点 ETCD(示例端口 2379): + openYuanrong datasystem 的集群管理依赖 ETCD,请先在后台启动单节点 ETCD(示例端口 2379): ```bash etcd --listen-client-urls http://0.0.0.0:2379 \ - --advertise-client-urls http://localhost:2379 + --advertise-client-urls http://localhost:2379 & ``` -- 一键启动集群 +- 一键部署 - 安装 yuanrong-datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具一键完成集群部署。启动一个监听端口号为 31501 的单机集群: + 安装 openYuanrong datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具一键完成集群部署。在当前启动一个监听端口号为 31501 的服务端进程: ```bash dscli start -w --worker_address "127.0.0.1:31501" --etcd_address "127.0.0.1:2379" ``` -- 集群卸载 +- 一键卸载 ```bash dscli stop --worker_address "127.0.0.1:31501" ``` -更多进程部署参数与部署方式请参考文档:[yuanrong-datasystem 进程部署](./docs/source_zh_cn/getting-started/deploy.md#yuanrong-datasystem进程部署) +更多进程部署参数与部署方式请参考文档:[openYuanrong datasystem 进程部署](./docs/source_zh_cn/getting-started/deploy.md#openYuanrong datasystem进程部署) #### Kubernetes 部署 -yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署前请确保部署环境集群已就绪 Kubernetes、Helm 及可访问的 ETCD 集群。 +openYuanrong datasystem 还提供了基于 Kubernetes 容器化部署方式,部署前请确保部署环境集群已就绪 Kubernetes、Helm 及可访问的 ETCD 集群。 -- 获取 yuanrong-datasystem helm chart 包 +- 获取 openYuanrong datasystem helm chart 包 - 安装 yuanrong-datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具在当前路径下快速获取 helm chart 包: + 安装 openYuanrong datasystem 完整发行版后,即可通过随包自带的 dscli 命令行工具在当前路径下快速获取 helm chart 包: ``` dscli generate_helm_chart -o ./ ``` - 编辑集群部署配置 - yuanrong-datasystem 通过 ./datasystem/values.yaml 文件进行集群相关配置,其中必配项如下: + openYuanrong datasystem 通过 ./datasystem/values.yaml 文件进行集群相关配置,其中必配项如下: ```yaml global: @@ -130,7 +130,7 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 imageRegistry: "swr.cn-south-1.myhuaweicloud.com/openeuler/" # 镜像名字和镜像tag images: - datasystem: "yr-datasystem:0.5.0-alpha" + datasystem: "openYuanrong-datasystem:0.5.0" etcd: # ETCD集群地址 @@ -139,41 +139,34 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 - 集群部署 - Helm 会提交 DaemonSet,按节点依次拉起 yuanrong-datasystem 实例: + Helm 会提交 DaemonSet,按节点依次拉起 openYuanrong datasystem 实例: ```bash - helm install yr_datasystem ./datasystem + helm install openyuanrong_datasystem ./datasystem ``` - 集群卸载 ```bash - helm uninstall yr_datasystem + helm uninstall openyuanrong_datasystem ``` -更多 yuanrong-datasystem Kubernetes 高级参数配置请参考文档:[yuanrong-datasystem Kubernetes 部署](./docs/source_zh_cn/getting-started/deploy.md#yuanrong-datasystem-kubernetes部署) +更多 openYuanrong datasystem Kubernetes 高级参数配置请参考文档:[openYuanrong datasystem Kubernetes 部署](./docs/source_zh_cn/getting-started/deploy.md#openyuanrong-datasystem-kubernetes部署) ### 代码样例 - 异构对象 - 通过异构对象接口实现 HBM 数据零拷贝发布/订阅 + 通过异构对象接口,将任意二进制数据以键值对形式写入 HBM: ```python import acl - import random - from datasystem.ds_client import DsClient - - def random_str(slen=10): - seed = "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!@#%^*()_+=-" - sa = [] - for _ in range(slen): - sa.append(random.choice(seed)) - return ''.join(sa) + import os + from datasystem import Blob, DsClient, DeviceBlobList - # hetero_dev_publish and hetero_dev_subscribe must be executed in different processes + # hetero_dev_mset and hetero_dev_mget must be executed in different processes # because they need to be bound to different NPUs. - def hetero_dev_publish(): + def hetero_dev_mset(): client = DsClient("127.0.0.1", 31501) client.init() @@ -183,7 +176,7 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 key_list = [ 'key1', 'key2', 'key3' ] data_size = 1024 * 1024 - test_value = random_str(data_size) + test_value = "value" in_data_blob_list = [] for _ in key_list: @@ -195,11 +188,9 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 tmp_batch_list.append(blob) blob_list = DeviceBlobList(device_idx, tmp_batch_list) in_data_blob_list.append(blob_list) - pub_futures = client.hetero().dev_publish(key_list, in_data_blob_list) - for future in pub_futures: - future.get() + client.hetero().dev_mset(key_list, in_data_blob_list) - def hetero_dev_subscribe(): + def hetero_dev_mget(): client = DsClient("127.0.0.1", 31501) client.init() @@ -218,14 +209,21 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 tmp_batch_list.append(blob) blob_list = DeviceBlobList(device_idx, tmp_batch_list) out_data_blob_list.append(blob_list) - sub_futures = client.hetero().dev_subscribe(key_list, out_data_blob_list) - for future in sub_futures: - future.get() + client.hetero().dev_mget(key_list, out_data_blob_list, 60000) + client.hetero().dev_delete(key_list) + + pid = os.fork() + if pid == 0: + hetero_dev_mset() + os._exit(0) + else: + hetero_dev_mget() + os.wait() ``` - KV - 通过 KV 接口,将任意二进制数据以键值对形式写入全局分布式缓存: + 通过 KV 接口,将任意二进制数据以键值对形式写入 DDR: ```python from datasystem.ds_client import DsClient @@ -245,12 +243,9 @@ yuanrong-datasystem 还提供了基于 Kubernetes 容器化部署方式,部署 ## 文档 -有关 yuanrong-datasystem 安装指南、教程和 API 的更多详细信息,请参阅 [用户文档](docs) - -查看 [openYuanrong 文档](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/index.html) 了解如何使用 openYuanrong 开发分布式应用。 +有关 openYuanrong datasystem 安装指南、教程和 API 的更多详细信息,请参阅 [用户文档](docs)。 -- 安装:`pip install openyuanrong`,[更多安装信息](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/deploy/installation.html)。 -- [快速入门](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/getting_started.html) +有关 openYuanrong 更多详细信息请参阅 [openYuanrong 文档](https://pages.openeuler.openatom.cn/openyuanrong/docs/zh-cn/latest/index.html),了解如何使用 openYuanrong 开发分布式应用。 ## 贡献 diff --git a/build.sh b/build.sh index 65bf1a7df43f208523601cc2eb680a8bb387476a..a625115faeb20a4d6583692df031c39a599cd345 100755 --- a/build.sh +++ b/build.sh @@ -87,12 +87,16 @@ Options: -m The timeout period of testcases, the unit is second, default: 40. Environment: -1) DS_OPENSOURCE_DIR: Specifies a directory to cache the opensource compilation result. +1) DS_JEMALLOC_LG_PAGE: Explicitly sets the page size used by the jemalloc. page size=2^\${DS_JEMALLOC_LG_PAGE} bytes. + When this variable is omitted, jemalloc infers the system's page size at build time from the build environment + (e.g., via sysconf(_SC_PAGESIZE) or equivalent). Only set DS_JEMALLOC_LG_PAGE if the runtime system's page size + differs from the one detected at build time and you must override the value. +2) DS_OPENSOURCE_DIR: Specifies a directory to cache the opensource compilation result. Cache the compilation result to speed up the compilation. Default: /tmp/{sha256(pwd)}/ -2) DS_VERSION: Customize a version number during compilation. -3) DS_PACKAGE: If specified, third-party libs for the path provided by this variable will be compiled, for +3) DS_VERSION: Customize a version number during compilation. +4) DS_PACKAGE: If specified, third-party libs for the path provided by this variable will be compiled, for version build only. -4) CTEST_OUTPUT_ON_FAILURE: Boolean environment variable that controls if the sdk output should be logged for +5) CTEST_OUTPUT_ON_FAILURE: Boolean environment variable that controls if the sdk output should be logged for failed tests. Set the value to 1, True, or ON to enable output on failure. Example: diff --git a/deploy/conf/cluster_config.json b/cli/deploy/conf/cluster_config.json similarity index 100% rename from deploy/conf/cluster_config.json rename to cli/deploy/conf/cluster_config.json diff --git a/deploy/conf/worker_config.json b/cli/deploy/conf/worker_config.json similarity index 100% rename from deploy/conf/worker_config.json rename to cli/deploy/conf/worker_config.json diff --git a/cmake/external_libs/jemalloc.cmake b/cmake/external_libs/jemalloc.cmake index 97cc370526d2c22947a83c66404c0974b368551b..9112aeeed12b73fc16b5bb5e24b4fe3fad66dca2 100644 --- a/cmake/external_libs/jemalloc.cmake +++ b/cmake/external_libs/jemalloc.cmake @@ -17,6 +17,11 @@ set(jemalloc_CONF_OPTIONS --disable-initial-exec-tls --with-jemalloc-prefix=datasystem_) +if (DEFINED ENV{DS_JEMALLOC_LG_PAGE}) + message(STATUS "jemalloc custom page size=2^$ENV{DS_JEMALLOC_LG_PAGE}") + list(APPEND jemalloc_CONF_OPTIONS --with-lg-page=$ENV{DS_JEMALLOC_LG_PAGE}) +endif() + set(jemalloc_C_FLAGS ${THIRDPARTY_SAFE_FLAGS}) set(jemalloc_LINK_FLAGS "-Wl,-z,now") @@ -47,6 +52,10 @@ set(JemallocShared_CONF_OPTIONS --disable-cxx --enable-stats) +if (DEFINED ENV{DS_JEMALLOC_LG_PAGE}) + list(APPEND JemallocShared_CONF_OPTIONS --with-lg-page=$ENV{DS_JEMALLOC_LG_PAGE}) +endif() + if (SUPPORT_JEPROF) message(STATUS "Support jemalloc memory profiling.") add_compile_definitions(SUPPORT_JEPROF) diff --git a/cmake/scripts/PackageDatasystem.cmake.in b/cmake/scripts/PackageDatasystem.cmake.in index 6cc77a1889d80fc0b815a9f0a21dfcd0ed842734..d70a5f8b8a4f694d00eb0ce6dc0666d1aaefc06b 100644 --- a/cmake/scripts/PackageDatasystem.cmake.in +++ b/cmake/scripts/PackageDatasystem.cmake.in @@ -23,6 +23,12 @@ foreach(FILE ${FILE_LIST}) file(APPEND "${DATASYSTEM_WHEEL_PATH}/sdk_lib_list" "${FILENAME}\n") endforeach() +find_program(CMAKE_STRIP NAMES strip) +file(GLOB SO_FILES "${DATASYSTEM_WHEEL_PATH}/lib/*.so*") +foreach(SO_FILE ${SO_FILES}) + execute_process(COMMAND ${CMAKE_STRIP} ${SO_FILE}) +endforeach() + # Run python setup.py bdist_whell to generate origin wheel file. execute_process(COMMAND ${Python3_EXECUTABLE} setup.py bdist_wheel WORKING_DIRECTORY ${DATASYSTEM_SETUP_PATH}) diff --git a/cmake/scripts/PackagePython.cmake.in b/cmake/scripts/PackagePython.cmake.in index 87ba161ec2d0776827bbe96e22e12bd994cc206d..a27d5d617ee428fa6e1dc9690722e26b6d6f21a6 100644 --- a/cmake/scripts/PackagePython.cmake.in +++ b/cmake/scripts/PackagePython.cmake.in @@ -15,6 +15,7 @@ message("cmake install prefix: ${CMAKE_INSTALL_PREFIX}") file(COPY ${PYTHON_LIBPATH}/ DESTINATION ${PYTHON_PACKAGE_LIBPATH}/lib REGEX ".*sym$" EXCLUDE) + file(RENAME ${PYTHON_PACKAGE_LIBPATH}/setup.py ${PYTHON_PACKAGE_PATH}/setup.py) # Run python setup.py bdist_whell to generate origin wheel file. execute_process(COMMAND ${Python3_EXECUTABLE} setup.py bdist_wheel diff --git a/cmake/util.cmake b/cmake/util.cmake index 51272048946854993530b6b305aca05bc55b1391..2dcd2aebdb9a168284d48747325e2355229c34ec 100644 --- a/cmake/util.cmake +++ b/cmake/util.cmake @@ -778,7 +778,7 @@ function(PACKAGE_DATASYSTEM_WHEEL PACKAGE_NAME) install(DIRECTORY ${CMAKE_SOURCE_DIR}/example/cpp_template DESTINATION ${DATASYSTEM_WHEEL_PATH}) # Copy worker and worker_config to package lib path - install(FILES ${CMAKE_INSTALL_PREFIX}/service/datasystem_worker ${CMAKE_SOURCE_DIR}/deploy/conf/worker_config.json ${CMAKE_SOURCE_DIR}/deploy/conf/cluster_config.json + install(FILES ${CMAKE_INSTALL_PREFIX}/service/datasystem_worker ${CMAKE_SOURCE_DIR}/cli/deploy/conf/worker_config.json ${CMAKE_SOURCE_DIR}/cli/deploy/conf/cluster_config.json DESTINATION ${DATASYSTEM_WHEEL_PATH}) find_package(Python3 COMPONENTS Interpreter Development) diff --git a/deploy/codegen.py b/deploy/codegen.py deleted file mode 100644 index 4f289717e2308975ea03b749fd28b57292e0e071..0000000000000000000000000000000000000000 --- a/deploy/codegen.py +++ /dev/null @@ -1,487 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright (c) Huawei Technologies Co., Ltd. 2022. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -""" -Module manager -""" - -from __future__ import absolute_import -from __future__ import print_function -from __future__ import with_statement - -import errno -import getopt -import json -import re -import os -import stat -import sys -from shutil import copyfile -from shutil import rmtree - - -class Conf: - """ - Configuration item structure. - - Attributes: - name: gflags name. - env: environment variables for conf files. - default: environment variables default value. - meaning: environment variables description. - """ - - def __init__(self, prefix, name, default, meaning, flag_prefix, flag_hide): - self.prefix = prefix - self.name = name - self.default = default - self.meaning = meaning - self.env = self.name.upper() - if not self.env.startswith(prefix.upper()): - self.env = prefix.upper() + '_' + self.env - self.flag_prefix = flag_prefix - self.flag_hide = flag_hide - - def __str__(self): - return "Conf name: [{}], default: [{}], meaning: [{}], env: [{}]".format(self.name, self.default, self.meaning, - self.env) - - -class ConfParser: - """ - Parse component configuration file and format to Conf list. - - Attributes: - component: worker. - dir_path: configuration files saved directory. - """ - - def __init__(self, component, dir_path): - self.component = component - self.dir_path = dir_path - - @staticmethod - def load(json_file): - """ - Load json file. - - Args: - json_file: Json file path. - - Returns: - A json object - - Raise: - IOError: json_file not exist. - """ - with open(json_file, 'r') as file: - json_obj = json.load(file) - return json_obj - - def parse(self): - """ - Parse deploy configuration json file. - - Returns: - A conf list that contains configuration items. - - Raise: - IOError: json_file not exist. - """ - component_json = ConfParser.load(os.path.join(self.dir_path, '{}-env.json'.format(self.component))) - flag_items = component_json['common'] - conf_list = [] - for flag_item in flag_items: - flag_item.setdefault('prefix', '') - flag_item.setdefault('hide_flag', '') - conf = Conf(self.component, flag_item['flag'], flag_item['default'], flag_item['description'], - flag_item['prefix'], flag_item['hide_flag']) - conf_list.append(conf) - return conf_list - - -def gen_conf_file(conf_list, output_file): - """ - Generate the configure file via configure item list. - - Args: - conf_list: Configuration item list, used for generate the output file. - output_file: Output file path. - - Returns: - None - - Raise: - IOError: output_file not exist. - OSError: create directory failed. - """ - # if output_file parent not exist, we need to create it first. - if not os.path.exists(os.path.dirname(output_file)): - try: - os.makedirs(os.path.dirname(output_file)) - # Guard against race condition - except OSError as exc: - if exc.errno != errno.EEXIST: - print('Create directory {} failed: {}'.format(os.path.dirname(output_file), exc)) - raise - - with os.fdopen(os.open(output_file, os.O_RDWR | os.O_CREAT, stat.S_IRUSR | stat.S_IWUSR), "w") as conf_file: - # 1. write the header - header = '#!/bin/bash\n' \ - '# Copyright (c) Huawei Technologies Co., Ltd. 2022. All rights reserved.\n' \ - '#\n' \ - '# Licensed under the Apache License, Version 2.0 (the "License");\n' \ - '# you may not use this file except in compliance with the License.\n' \ - '# You may obtain a copy of the License at\n' \ - '#\n' \ - '# http://www.apache.org/licenses/LICENSE-2.0\n' \ - '#\n' \ - '# Unless required by applicable law or agreed to in writing, software\n' \ - '# distributed under the License is distributed on an "AS IS" BASIS,\n' \ - '# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n' \ - '# See the License for the specific language governing permissions and\n' \ - '# limitations under the License.\n' \ - '#\n' \ - '# Edit this file to configure startup parameters, it is sourced to launch components.\n' - conf_file.write(header) - - for conf in conf_list: - # 2. write the description. - conf_file.write('\n') - conf_file.write('# {} (Default: "{}")\n'.format(conf.meaning, conf.default)) - # 3. write the environment variables. - conf_file.write('# {}="{}"\n'.format(conf.env, conf.default)) - - -def gen_launcher_file(comp, template_file, output_file): - """ - Generate component-launcher.sh for component launch to local or remote hosts. - - Args: - comp: component - template_file: template file to generate the component-launcher.sh - output_file: output file - - Returns: - None. - - Raise: - IOError: output_file not exist. - OSError: create directory failed. - """ - # if output_file parent not exist, we need to create it first. - if not os.path.exists(os.path.dirname(output_file)): - try: - os.makedirs(os.path.dirname(output_file)) - # Guard against race condition - except OSError as exc: - if exc.errno != errno.EEXIST: - print('Create directory {} failed: {}'.format(os.path.dirname(output_file), exc)) - raise - - if comp == 'clusterfs': - comp_upper = 'CLUSTERFS_WORKER' - else: - comp_upper = comp.upper() - - launcher_conf = '${BASE_DIR}/conf' - - subs = { - 'component': comp, - 'COMPONENT': comp_upper, - 'launch-conf': launcher_conf - } - - with open(template_file, "r") as fin: - with os.fdopen(os.open(output_file, os.O_RDWR | os.O_CREAT, stat.S_IRUSR | stat.S_IWUSR), "w") as fout: - for line in fin: - fout.write(re.sub('@([^\\s]*?)@', from_dict(subs), line)) - - -def from_dict(dct): - """ - Look up the match key and return its value. - - Args: - dct: Dict - - Returns: - The match value. - """ - - def lookup(match): - key = match.group(1) - return dct.get(key, '') - - return lookup - -STOP_FUNC_STR = 'function stop_one_{0}()\n' \ - '{{\n' \ - ' is_array "{1}_ADDRESS" && {0}_address="${{{1}_ADDRESS[$COMPONENT_NUM]}}" ||' \ - ' {0}_address="${{{1}_ADDRESS}}"\n' \ - ' local pid="$(ps ux | grep /datasystem_worker | grep {0}_address=${{{0}_address}} | grep -v grep' \ - ' | awk \'{{print $2}}\')"\n' \ - ' if ! is_num "${{pid}}"; then\n' \ - ' echo -e "Cannot found the {0} we want: ${{{0}_address}}" >&2\n' \ - ' exit 0\n' \ - ' fi\n' \ - ' kill -15 "$pid"\n' \ - ' local timeout=1800\n' \ - ' local start_time=$(date +%s)\n' \ - ' while kill -0 $pid; do \n' \ - ' if [[ $(($(date +%s) - start_time)) -ge $timeout ]]; then\n' \ - ' echo "Termination timeout (30 minutes) reached - killing process."\n' \ - ' kill -9 $pid\n' \ - ' break\n' \ - ' fi\n' \ - ' sleep 0.5\n' \ - ' done\n' \ - '}}\n' - -DEPLOY_FUNC_STR = 'function deploy_{0}()\n' \ - '{{\n' \ - ' if [ x"${{ACTION}}" = "xstart" ]; then\n' \ - ' "${{LAUNCHER[@]}}" "${{BASE_DIR}}/{0}-launcher.sh" "${{BASE_DIR}}/deploy-datasystem.sh"' \ - ' "-a" "start" "-c" "{0}" "-p" "${{CONF_DIR}}"\n' \ - ' else\n' \ - ' "${{LAUNCHER[@]}}" "${{BASE_DIR}}/{0}-launcher.sh" "${{BASE_DIR}}/deploy-datasystem.sh"' \ - ' "-a" "stop" "-c" "{0}" "-p" "${{CONF_DIR}}"\n' \ - ' fi\n' \ - '}}\n' - -DEPLOY_ONE_FUNC_STR = 'function deploy_one_{0}()\n' \ - '{{\n' \ - ' . "${{CONF_DIR}}/{0}-env.sh"\n' \ - ' if [ x"${{ACTION}}" = "xstart" ]; then\n' \ - ' start_one_{0}\n' \ - ' else\n' \ - ' stop_one_{0}\n' \ - ' fi\n' \ - '}}\n' - -STOP_CLUSTER_STR = 'function stop_one_clusterfs()\n' \ - '{\n' \ - ' is_array "CLUSTERFS_WORKER_ADDRESS" &&' \ - ' cluster_address="${CLUSTERFS_WORKER_ADDRESS[$COMPONENT_NUM]}" ||' \ - ' cluster_address="${CLUSTERFS_WORKER_ADDRESS}"\n' \ - ' pid="$(ps ux | grep bin/clusterfs | grep ${cluster_address} ' \ - '| grep -v grep | awk \'{print $2}\')"\n' \ - ' ps ux | grep bin/clusterfs | grep worker_address=${cluster_address} ' \ - '| grep -v grep | awk \'{print $NF}\' | xargs fusermount3 -u\n' \ - ' while [[ -n $(ps -p "$pid" | grep "$pid") ]]; do sleep 0.5; done\n' \ - '}\n' - - -def gen_start_function(comp, conf_list, func_list): - """ - Generate start_one_component function for component - - Args: - comp: component - conf_list: configuration items list - func_list: function list to be append - - Returns: - None. - """ - indent = ' ' * 2 - func_list.append('') - func_list.append('function start_one_{}()'.format(comp)) - func_list.append('{') - func_list.append('{}local argv_list'.format(indent)) - gen_arg_list(comp, conf_list, func_list) - func_list.append('{}export LD_LIBRARY_PATH="${{BIN_DIR}}/lib:$LD_LIBRARY_PATH"'.format(indent)) - func_list.append( - f'{indent}(nohup "${{BIN_DIR}}/datasystem_worker" "${{argv_list[@]}}" ' - f'>${{BASE_DIR}}/{comp}.out 2>&1) &' - ) - func_list.append('{}local pid=$!'.format(indent)) - func_list.append('{}sleep 5'.format(indent)) - func_list.append('{}[[ -n $(ps -p "$pid" | grep "$pid") ]] && ps -p "$pid" -o args || ret_code=1'.format(indent)) - func_list.append('{}if [[ $ret_code -ne 0 ]]; then'.format(indent)) - func_list.append('{}cat ${{BASE_DIR}}/{}.out'.format(indent * 2, comp)) - func_list.append('{}fi'.format(indent)) - func_list.append('{}return $ret_code'.format(indent)) - func_list.append('}') - func_list.append('') - if comp == 'clusterfs': - func_list.append(STOP_CLUSTER_STR) - else: - func_list.append(STOP_FUNC_STR.format(comp, comp.upper())) - func_list.append(DEPLOY_ONE_FUNC_STR.format(comp)) - func_list.append(DEPLOY_FUNC_STR.format(comp)) - - -def gen_arg_list(comp, conf_list, func_list): - """ - Generate argument list for component - - Args: - comp: component - conf_list: configuration items list - func_list: function list to be append - - Returns: - None. - """ - indent = ' ' * 2 - for conf in conf_list: - func_list.append('{0}is_array "{1}" && {2}="${{{1}[$COMPONENT_NUM]}}" ' - '|| {2}="${{{1}}}"'.format(indent, conf.env, conf.name)) - if comp != "clusterfs": - flag = '-{}'.format(conf.name) - func_list.append( - '{0}[[ -n "${{{1}}}" ]] && argv_list+=("{2}=${{{1}}}")'.format(indent, conf.name, flag)) - continue - - if conf.flag_hide: - func_list.append( - '{0}[[ -n "${{{1}}}" ]] && argv_list+=("${{{1}}}")'.format(indent, conf.name)) - continue - - if conf.flag_prefix == '': - flag = '--{}'.format(conf.name) - func_list.append( - '{0}[[ -n "${{{1}}}" ]] && argv_list+=("{2}=${{{1}}}")'.format(indent, conf.name, flag)) - continue - - func_list.append( - '{0}[[ -n "${{{1}}}" ]] && argv_list+=("{2}" "{3}=${{{1}}}")'.format(indent, conf.name, - conf.flag_prefix, - conf.name)) - if comp == 'clusterfs': - func_list.append(' mkdir -p "${CLUSTERFS_MOUNT_DIR}"') - func_list.append(' chmod -R 700 "${CLUSTERFS_MOUNT_DIR}"') - - -def gen_shell_script(comp_dict, template_dir, output_dir): - """ - Generate shell scripts via configure item list. - - Args: - comp_dict: master, worker or gcs - template_dir: Configuration item list, used for generate the output file. - output_dir: template file for generate output files. - - Returns: - None - - Raise: - IOError: output_file not exist. - OSError: create directory failed. - """ - - func_list = [] - deploy_list = [] - path_list = [] - for comp, conf_list in comp_dict.items(): - gen_start_function(comp, conf_list, func_list) - if comp not in ['master', 'worker']: - if comp == 'clusterfs': - deploy_list.append(' if [ "x${HANDLE_CLUSTERFS}" = "xYes" ]; then') - deploy_list.append(' deploy_{}'.format(comp)) - deploy_list.append(' fi') - else: - deploy_list.append(' deploy_{}'.format(comp)) - - path_list.append('export BIN_DIR="$(realpath "${BASE_DIR}/..")"') - path_list.append('export CONF_DIR=$(realpath "${BASE_DIR}/conf")') - - subs = { - 'function': '\n'.join(func_list), - 'deploy': '\n'.join(deploy_list), - 'path': '\n '.join(path_list) - } - - if not os.path.exists(output_dir): - try: - os.makedirs(output_dir) - # Guard against race condition - except OSError as exc: - if exc.errno != errno.EEXIST: - print('Create directory {} failed: {}'.format(output_dir, exc)) - raise - - with open(os.path.join(template_dir, "deploy-datasystem.sh.template"), "r") as fin, \ - os.fdopen(os.open(os.path.join(output_dir, "deploy-datasystem.sh"), os.O_RDWR | os.O_CREAT, - stat.S_IRUSR | stat.S_IWUSR), "w") as fout: - for line in fin: - if line == " deploy_master\n": - continue - fout.write(re.sub('@([^\\s]*?)@', from_dict(subs), line)) - copyfile(os.path.join(template_dir, 'deploy-common.sh.template'), os.path.join(output_dir, 'deploy-common.sh')) - - -def main(argv): - """conf_list - Main to start the deploy script. - - Args: - argv: external input arguments. - - Returns: - 0 on success, otherwise return 1. - """ - try: - opts, _ = getopt.getopt(argv, 'h:t:i:o:s:c:', - ['help', 'temp_in=', 'conf_in=', 'conf_out=', 'script_out=', 'comps=']) - except getopt.GetoptError as exception: - print('Parse input arguments error: {}'.format(exception)) - sys.exit(1) - - temp_in = '' - conf_in = '' - conf_out = '' - script_out = '' - comps = [] - for opt, arg in opts: - if opt in ('-h', '--help'): - print("helpful usage") - elif opt in ('-t', '--temp_in'): - temp_in = arg - elif opt in ('-i', '--conf_in'): - conf_in = arg - elif opt in ('-o', '--conf_out'): - conf_out = arg - elif opt in ('-s', '--script_out'): - script_out = arg - elif opt in ('-c', '--comps'): - comps = arg.split(',') - - comp_dict = {} - - if os.path.isdir(conf_out): - rmtree(conf_out) - - if os.path.isdir(script_out): - rmtree(script_out) - - for component in comps: - # 1. parse json files. - conf_list = ConfParser(component, conf_in).parse() - comp_dict[component] = conf_list - # 2. generate conf files. - gen_conf_file(conf_list, os.path.join(conf_out, '{}-env.sh'.format(component))) - # 3. generate launcher files. - gen_launcher_file(component, os.path.join(temp_in, "component-launcher.sh.template"), - os.path.join(script_out, '{}-launcher.sh'.format(component))) - - # 4. generate deploy-datasystem.sh - gen_shell_script(comp_dict, temp_in, script_out) - - -if __name__ == "__main__": - main(sys.argv[1:]) diff --git a/deploy/conf/worker-env.json b/deploy/conf/worker-env.json deleted file mode 100644 index d455ebb5d7d8a1ab0cf78501c9f271749806b78c..0000000000000000000000000000000000000000 --- a/deploy/conf/worker-env.json +++ /dev/null @@ -1,539 +0,0 @@ -{ - "common": [ - { - "flag": "worker_address", - "default": "127.0.0.1:9088", - "description": "Address of worker and the value cannot be empty. Multiple nodes can be configured, such as (\"127.0.0.1:18482\" \"127.0.0.2:18482\")." - }, - { - "flag": "etcd_address", - "default": "", - "description": "Address of ETCD server, (such as \"192.168.0.1:10001,192.168.0.2:10001,192.168.0.3:10001\")." - }, - { - "flag": "master_address", - "default": "", - "description": "Address of master and the value cannot be empty." - }, - { - "flag": "shared_memory_size_mb", - "default": "1024", - "description": "Upper limit of the shared memory, the unit is mb, must be greater than 0." - }, - { - "flag": "oc_shm_threshold_percentage", - "default": "100", - "description": "Upper limit of the shared memory in percentage can be used by OC, must be within (0, 100]" - }, - { - "flag": "shared_disk_directory", - "default": "", - "description": "Disk cache data placement directory, default value is empty, indicating that disk cache is not enabled." - }, - { - "flag": "shared_disk_size_mb", - "default": "0", - "description": "The total size of disk cache data, the unit is MB." - }, - { - "flag": "shared_disk_arena_per_tenant", - "default": "8", - "description": "The number of disk cache Arena." - }, - { - "flag": "heartbeat_interval_ms", - "default": "1000", - "description": "Time interval between worker and etcd heartbeats." - }, - { - "flag": "authorization_enable", - "default": "false", - "description": "Indicates whether to enable the tenant authentication, default is false." - }, - { - "flag": "ipc_through_shared_memory", - "default": "true", - "description": "Using shared memory to exchange data between client and worker. if this parameter is set to be true, client and worker will pass control messages through uds; Otherwise, they pass control messages through tcp/ip, and exchange data through tcp/ip." - }, - { - "flag": "unix_domain_socket_dir", - "default": "~/.datasystem/unix_domain_socket_dir", - "description": "The directory to store unix domain socket file. The UDS generates temporary files in this path. Max lenth: 80" - }, - { - "flag": "etcd_table_prefix", - "default": "", - "description": "Prefix of all tables in etcd, which is used to distinguish tables created by different data systems in etcd." - }, - { - "flag": "other_az_names", - "default": "", - "description": "Specify other az names using the same etcd. Split by ','" - }, - { - "flag": "oc_io_from_l2cache_need_metadata", - "default": "true", - "description": "Control whether data read and write from the L2 cache daemon depend on metadata. Note: If set to false, it indicates that the metadata is not stored in etcd." - }, - { - "flag": "enable_distributed_master", - "default": "true", - "description": "Whether to support distributed master, default is true." - }, - { - "flag": "oc_worker_worker_direct_port", - "default": "0", - "description": "A direct tcp/ip port for worker to workers scenarios to improve latency. Acceptable values:0, or some positive integer. 0 means disabled." - }, - { - "flag": "oc_worker_worker_pool_size", - "default": "3", - "description" : "Number of parallel connections between worker/worker oc service. Flag oc_worker_worker_direct_port must be enabled to take effect." - }, - { - "flag": "payload_nocopy_threshold", - "default" : "104857600", - "description" : "minimum payload size to trigger no memory copy" - }, - { - "flag": "oc_thread_num", - "default": "32", - "description": "The number of worker service for object cache." - }, - { - "flag": "eviction_thread_num", - "default": "1", - "description": "Thread number of eviction for object cache." - }, - { - "flag": "eviction_reserve_mem_threshold_mb", - "default": "10240", - "description": "The reserved memory (MB) is determined by min(shared_memory_size_mb*0.2, eviction_reserve_mem_threshold_mb). Eviction begins when memory drops below this threshold.The valid range is 100-102400." - }, - { - "flag": "client_reconnect_wait_s", - "default": "5", - "description": "Client reconnect wait seconds, default is 5." - }, - { - "flag": "etcd_meta_pool_size", - "default": "8", - "description": "ETCD metadata async pool size." - }, - { - "flag": "spill_directory", - "default": "", - "description": "The path and file name prefix of the spilling, empty means spill disabled." - }, - { - "flag": "spill_size_limit", - "default": 0, - "description": "Maximum amount of spilled data that can be stored in the spill directory. If spill is enable and spill_size_limit is 0, spill_size_limit will be set to 95% of the spill directory." - }, - { - "flag": "spill_thread_num", - "default": "8", - "description": "It represents the maximum parallelism of writing files, more threads will consume more CPU and I/O resources." - }, - { - "flag": "spill_file_max_size_mb", - "default": "200", - "description": "The size limit of single spill file, spilling objects which lager than that value with one object per file. If there are some big objects, you can increase this value to avoid run out of inodes quickly. The valid range is 200-10240." - }, - { - "flag": "spill_file_open_limit", - "default": "512", - "description": "The maximum number of open file descriptors about spill. If opened file exceed this value, some files will be temporarily closed to prevent exceeding the maximum system limit. You need reduce this value if your system resources are limited. The valid range is greater than or equal to 8." - }, - { - "flag": "spill_enable_readahead", - "default": "true", - "description": "Disable readahead can mitigate the read amplification problem for offset read, default is true" - }, - { - "flag": "log_monitor", - "default": "true", - "description": "Record performance and resource logs." - }, - { - "flag": "log_monitor_exporter", - "default": "harddisk", - "description": "Specify the type of exporter, either harddisk or backend. Only takes effect when log_monitor is true." - }, - { - "flag": "log_monitor_interval_ms", - "default": "10000", - "description": "The sleep time between iterations of observability collector scan." - }, - { - "flag": "minloglevel", - "default": "0", - "description": "Log messages below this level will not actually be recorded anywhere." - }, - { - "flag": "system_access_key", - "default": "", - "description": "The access key for system component AK/SK authentication." - }, - { - "flag": "system_secret_key", - "default": "", - "description": "The secret key for system component AK/SK authentication." - }, - { - "flag": "system_data_key", - "default": "", - "description": "The data key for system to decrypt sensitive data. The length of encrypted datakey should be 32" - }, - { - "flag": "tenant_access_key", - "default": "", - "description": "The access key for tenant AK/SK authentication." - }, - { - "flag": "tenant_secret_key", - "default": "", - "description": "The secret key for tenant AK/SK authentication." - }, - { - "flag": "request_expire_time_s", - "default": "300", - "description": "When AK/SK authentication is used, if the duration from the client to the server is longer than this parameter, the authentication fails and the service is denied." - }, - { - "flag": "max_client_num", - "default": "200", - "description": "Maximum number of clients that can be connected to a worker. Value range: [1, 10000]." - }, - { - "flag": "l2_cache_type", - "default": "none", - "description": "Config the l2 cache type, obs. Optional value: 'obs', 'none'" - }, - { - "flag": "cache_rpc_session", - "default": "true", - "description": "Deprecated: This flag is deprecated and will be removed in future releases." - }, - { - "flag": "backend_store_dir", - "default": "~/.datasystem/rocksdb", - "description": "Config MASTER back store directory and must specify in rocksdb scenario. The rocksdb database is used to persistently store the metadata stored in the master so that the metadata before the restart can be re-obtained when the master restarts." - }, - { - "flag": "rocksdb_sync_write", - "default": "false", - "description": "Controls whether rocksdb sets sync to true when writing data." - }, - { - "flag": "rocksdb_max_open_file", - "default": "128", - "description": "Number of open files that can be used by the rocksdb" - }, - { - "flag": "rocksdb_background_threads", - "default": "16", - "description": "Number of background threads rocksdb can use for flushing and compacting." - }, - { - "flag": "node_timeout_s", - "default": "60", - "description": "Maximum time interval before a node is considered lost." - }, - { - "flag": "node_dead_timeout_s", - "default": "300", - "description": "Maximum time interval for the etcd to determine node death." - }, - { - "flag": "arena_per_tenant", - "default": "16", - "description": "The arena count for each tenant. Multiple arenas can improve the performance of share memory allocation for the first time, but each arena will use one more fd." - }, - { - "flag": "memory_reclamation_time_second", - "default": "600", - "description": "The memory reclamation time after free." - }, - { - "flag": "add_node_wait_time_s", - "default": "60", - "description": "Time to wait for the first node that wants to join a working hash ring." - }, - { - "flag": "auto_del_dead_node", - "default": "true", - "description": "Decide whether to remove the node from hash ring or not when node is dead" - }, - { - "flag": "enable_hash_ring_self_healing", - "default": "false", - "description": "Whether to support self-healing when the hash ring is in an abnormal state, default is false." - }, - { - "flag": "client_dead_timeout_s", - "default": "120", - "description": "Maximum time interval for the worker to determine client death, Value range: [15, UINT64_MAX)." - }, - { - "flag": "monitor_config_file", - "default": "~/.datasystem/config/datasystem.config", - "description": "Configure the path of the worker monitoring configuration file. During the execution of the worker, it periodically monitors whether the configuration file is changed to update the flag parameter value." - }, - { - "flag": "loglevel_only_for_workers", - "default": "", - "description": "The log level can be configured to take effect only for specified workers, such as (-loglevel_only_for_workers=192.168.0.1,192.168.0.2). If this parameter is left blank, the parameter takes effect for all workers." - }, - { - "flag": "enable_thp", - "default": "false", - "description": "Control this process by enabling transparent huge pages, default is disabled. Enable Transparent Huge Pages (THP) can enhance performance and reduce page table overhead, but it may also lead to increased memory usage" - }, - { - "flag": "cross_az_get_data_from_worker", - "default": "true", - "description": "Control whether try to get data from other AZ's worker firstly, if false then get data from L2 cache directly." - }, - { - "flag": "cross_az_get_meta_from_worker", - "default": "false", - "description": "Control whether get meta data from other AZ's worker, if false then get meta data from local AZ." - }, - { - "flag": "enable_reconciliation", - "default": "true", - "description": "Whether to enable reconciliation, default is true." - }, - { - "flag": "check_async_queue_empty_time_s", - "default": "1", - "description": "The worker ensures a certain period of time that the asynchronous queues for sending messages to ETCD and L2 cache remain empty before it can exit properly." - }, - { - "flag": "obs_access_key", - "default": "", - "description": "Pass in access key for authentication when connecting to OBS." - }, - { - "flag": "obs_secret_key", - "default": "", - "description": "Pass in secret key for authentication when connecting to OBS." - }, - { - "flag": "obs_endpoint", - "default": "", - "description": "OBS address." - }, - { - "flag": "obs_bucket", - "default": "", - "description": "Name of OBS bucket to use. Can use only one bucket" - }, - { - "flag": "obs_https_enabled", - "default": "false", - "description": "Whether to enable the https in obs. false: use HTTP (default), true: use HTTPS" - }, - { - "flag": "sfs_path", - "default": "", - "description": "The path to the mounted SFS." - }, - { - "flag": "health_check_path", - "default": "~/.datasystem/probe/healthy", - "description": "File will create after the worker successfully." - }, - { - "flag": "rpc_thread_num", - "default": "16", - "description": "Config rpc server thread number, must be great equal than 0." - }, - { - "flag": "log_dir", - "default": "~/.datasystem/logs", - "description": "The directory where log files are stored." - }, - { - "flag": "log_filename", - "default": "", - "description": "Prefix of log filename, default is program invocation short name. Use standard characters only." - }, - { - "flag": "log_async_queue_size", - "default": "65536", - "description": "Size of async logger's message queue." - }, - { - "flag": "max_log_size", - "default": "400", - "description": "Maximum log file size (in MB), must be greater than 0." - }, - { - "flag": "max_log_file_num", - "default": "5", - "description": "Maximum number of log files to retain per severity level. And every log file size is limited by max_log_size." - }, - { - "flag": "log_retention_day", - "default": "0", - "description": "If log_retention_day is greater than 0, any log file from your project whose last modified time is greater than log_retention_day days will be unlinked. If log_retention_day is equal 0, will not unlink log file by time." - }, - { - "flag": "log_async", - "default": "true", - "description": "Flush log files with async mode." - }, - { - "flag": "logbufsecs", - "default": "10", - "description": "Buffer log messages for at most this many seconds." - }, - { - "flag": "log_compress", - "default": "true", - "description": "Compress old log files in .gz format. This parameter takes effect only when the size of the generated log is greater than max log size." - }, - { - "flag": "v", - "default": "0", - "description": "vlog level." - }, - { - "flag": "enable_component_auth", - "default": "false", - "description": "Whether to enable the authentication function between components(worker, master)." - }, - { - "flag": "zmq_server_io_context", - "default": "5", - "description": "Optimize the performance of the customer. Default server 5. The higher the throughput, the higher the value, but should be in range [1, 32]." - }, - { - "flag": "zmq_client_io_context", - "default": "5", - "description": "Optimize the performance of a client stub. Default value 5. The higher the throughput, the higher the value, but should be in range [1, 32]." - }, - { - "flag": "zmq_chunk_sz", - "default": "1048576", - "description": "Parallel payload split chunk size. Default to 1048756 bytes." - }, - { - "flag": "curve_key_dir", - "default": "", - "description": "The directory to find ZMQ curve key files. This path must be specified when zmq authentication is enabled." - }, - { - "flag": "encrypt_kit", - "default": "plaintext", - "description": "choose the type of encrypt. Support plaintext, default is plaintext." - }, - { - "flag": "enable_etcd_auth", - "default": "false", - "description": "Whether to enable ETCD auth, default is false. ETCD certificate will be obtained by sts. If you want to enable etcd auth, configure encrypt_kit is sts at the same time." - }, - { - "flag": "etcd_target_name_override", - "default": "", - "description": "Set etcd target name override for SSL host name checking, default is none. The configuration value should be consistent with the DNS content of the Subject Alternate Names of the TLS certificate." - }, - { - "flag": "etcd_ca", - "default": "", - "description": "The path of encrypted root etcd certificate, default is none." - }, - { - "flag": "etcd_cert", - "default": "", - "description": "The path of encrypted client's etcd certificate chain, default is none." - }, - { - "flag": "etcd_key", - "default": "", - "description": "The path of encrypted client's etcd private key, default is none." - }, - { - "flag": "etcd_passphrase_path", - "default": "", - "description": "The path of passphrase for the encrypted etcd encypted private key, default is none." - }, - { - "flag": "enable_urma", - "default" : "false", - "description" : "Option to switch between RPC (ZMQ) and RDMA for data transfer, default false to run with RPC." - }, - { - "flag": "urma_poll_size", - "default" : "8", - "description" : "Number of complete record to poll at a time, 16 is the max this device can poll" - }, - { - "flag": "urma_register_whole_arena", - "default" : "true", - "description" : "Register the whole arena as segment during init, otherwise, register each object as a segment." - }, - { - "flag": "urma_connection_size", - "default" : "16", - "description" : "Number of jfs and jfr pair" - }, - { - "flag": "urma_event_mode", - "default" : "false", - "description" : "Uses interrupt mode to poll completion events." - }, - { - "flag": "enable_worker_worker_batch_get", - "default" : "false", - "description" : "Enable worker->worker OC batch get, default false." - }, - { - "flag": "batch_get_threshold_mb", - "default" : "100", - "description": "The payload threshold to batch get objects, the unit is mb, must be greater than 0. Setting to 0 will indicate no split." - }, - { - "flag": "max_rpc_session_num", - "default": "2048", - "description": "Maximum number of sessions that can be cached, must be within [512, 10'000]" - }, - { - "flag": "enable_stream_data_verification", - "default": "false", - "description": "Option to turn on data verification to verify data from a producer is received in order." - }, - { - "flag": "enable_huge_tlb", - "default": "false", - "description": "This is controlled by the flag of mmap(MAP_HUGETLB) which can improve memory access and reducing the overhead of page table, default is disable." - }, - { - "flag": "enable_fallocate", - "default": "true", - "description": "Due to k8s's resource calculation policies,shared memory is sometimes multiple counted , which can lead to client crashes, Using fallocate to address this issue." - }, - { - "flag": "oc_shm_transfer_threshold_kb", - "default": "500", - "description": "The data threshold to transfer obj data between client and worker via shm, unit is KB." - }, - { - "flag": "enable_lossless_data_exit_mode", - "default": "false", - "description": "Migrate data to another node before the node exits if other nodes are available. If this is the only node in the cluster, exits directly and the data will be lost." - }, - { - "flag": "rolling_update_timeout_s", - "default": "1800", - "description": "Maximum duration of the rolling upgrade, default value is 1800 seconds." - }, - { - "flag": "enable_p2p_transfer", - "default": "false", - "description": "Heterogeneous object transfer protocol Enables p2ptransfer." - } - ] -} diff --git a/deploy/template/component-launcher.sh.template b/deploy/template/component-launcher.sh.template deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/docs/source_zh_cn/getting-started/deploy.md b/docs/source_zh_cn/getting-started/deploy.md index 7ffb05fe8bb54b79e276e9e7ca22ad5dff23444b..9928fd406aab4384293048a1d861dacdcc47869b 100644 --- a/docs/source_zh_cn/getting-started/deploy.md +++ b/docs/source_zh_cn/getting-started/deploy.md @@ -1,7 +1,7 @@ -# 部署yr-datasystem +# 部署openYuanrong datasystem -- [yr-datasystem进程部署](#yr-datasystem进程部署) +- [openYuanrong datasystem进程部署](#openyuanrong-datasystem进程部署) - [部署环境准备](#部署环境准备) - [集群部署](#集群部署) - [单机部署](#单机部署) @@ -10,7 +10,7 @@ - [集群卸载](#集群卸载) - [单机卸载](#单机卸载) - [多机卸载](#多机卸载) -- [yr-datasystem Kubernetes部署](#yr-datasystem-kubernetes部署) +- [openYuanrong datasystem Kubernetes部署](#openyuanrong-datasystem-kubernetes部署) - [部署环境准备](#部署环境准备-1) - [集群部署](#集群部署-1) - [快速验证](#快速验证-1) @@ -20,19 +20,19 @@ [![查看源文件](https://Mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](deploy.md) -本文档介绍如何将yr-datasystem通过裸进程或者Kubernetes的方式进行部署。 +本文档介绍如何将openYuanrong datasystem通过裸进程或者Kubernetes的方式进行部署。 -## yr-datasystem进程部署 +## openYuanrong datasystem进程部署 ### 部署环境准备 -yr-datasystem进程部署所需的系统环境依赖如下: +openYuanrong datasystem进程部署所需的系统环境依赖如下: |软件名称|版本|作用| |-------|----|----| -|EulerOS 2.8/openEuler 20.03|-|运行yr-datasystem的操作系统| +|EulerOS 2.8/openEuler 20.03|-|运行openYuanrong datasystem的操作系统| |[CANN](#安装cann)|8.0.0或8.0.rc2|运行异构相关特性的依赖库| -|[Python](#安装python)|3.10-3.11|yr-datasystem dscli的使用依赖Python环境| -|[dscli](#安装dscli)|-|用于部署yr-datasystem的命令行工具| -|[ETCD](#安装并部署etcd)|3.5|yr-datasystem集群管理依赖组件| +|[Python](#安装python)|3.10-3.11|openYuanrong datasystem dscli的使用依赖Python环境| +|[dscli](#安装dscli)|-|用于部署openYuanrong datasystem的命令行工具| +|[ETCD](#安装并部署etcd)|3.5|openYuanrong datasystem集群管理依赖组件| |[SSH互信配置](#ssh互信配置)|-|仅多机部署需要,配置SSH互信用于机器间互相访问| 下面给出以上依赖的安装方法。 @@ -76,8 +76,8 @@ conda init bash 创建虚拟环境,以Python 3.11.4为例: ```bash -conda create -n yr-datasystem_py311 python=3.11.4 -y -conda activate yr-datasystem_py311 +conda create -n openYuanrong datasystem_py311 python=3.11.4 -y +conda activate openYuanrong datasystem_py311 ``` 可以通过以下命令查看Python版本。 @@ -87,7 +87,7 @@ python --version ``` ### 安装dscli -dscli命令行工具集成在yr-datasystem的wheel包 `openyuanrong_datasystem--cp311-cp311-manylinux_2_34_.whl`中,安装yr-datasystem请参考[安装yr-datasystem](install.md)。 +dscli命令行工具集成在openYuanrong datasystem的wheel包 `openyuanrong_datasystem--cp311-cp311-manylinux_2_34_.whl`中,安装openYuanrong datasystem请参考[安装openYuanrong datasystem](install.md)。 安装完成后,运行如下命令: ```bash @@ -194,7 +194,7 @@ ssh username@hostname ### 集群部署 -yr-datasystem集群依赖ETCD,部署前需要先部署ETCD,部署ETCD可参考:[安装并部署ETCD](#安装并部署etcd)。 +openYuanrong datasystem集群依赖ETCD,部署前需要先部署ETCD,部署ETCD可参考:[安装并部署ETCD](#安装并部署etcd)。 #### 单机部署 @@ -267,7 +267,7 @@ dscli up -f ./cluster_config.json > 注意事项: > -> - yr-datasystem集群依赖ETCD,部署前需要先部署ETCD,部署ETCD可参考:[安装并部署ETCD](#安装并部署etcd)。 +> - openYuanrong datasystem集群依赖ETCD,部署前需要先部署ETCD,部署ETCD可参考:[安装并部署ETCD](#安装并部署etcd)。 > - 多机集群部署依赖多机之间配置SSH互信,请参考:[SSH互信配置](#ssh互信配置)。 > - 需要部署的机器上都已安装dscli,dscli安装可参考:[安装dscli](#安装dscli)。 > @@ -285,7 +285,7 @@ client = DsClient("127.0.0.1", 31501) client.init() ``` -当脚本执行未发生异常时说明yr-datasystem的客户端能正常连接上当前节点的ds-worker,部署成功。 +当脚本执行未发生异常时说明openYuanrong datasystem的客户端能正常连接上当前节点的ds-worker,部署成功。 ### 集群卸载 @@ -323,22 +323,22 @@ dscli down -f ./cluster_config.json 当输出如上信息时说明集群卸载成功。 -## yr-datasystem Kubernetes部署 +## openYuanrong datasystem Kubernetes部署 ### 部署环境准备 -yr-datasystem Kubernetes部署所需的依赖如下: +openYuanrong datasystem Kubernetes部署所需的依赖如下: |软件名称|推荐版本|作用| |--------|-------|----| |EulerOS 2.8/openEuler 20.03|-|支持运行Kubernetes与Docker的操作系统| |[kubectl](#安装kubectl)|-|运行异构相关特性的依赖库| -|[Kubernetes](#安装kubernetes)|-|Kubernetes集群,用于编排和管理yr-datasystem的容器| -|[Helm](#安装helm)|-|yr-datasystem dscli的使用依赖Python环境| -|[Docker](#安装docker)|-|提供容器化平台,支持yr-datasystem容器化部署和运行| -|[ETCD](#安装并部署etcd)|3.5|yr-datasystem集群管理依赖组件| -|[yr-datasystem镜像](#获取yr-datasystem镜像)|-|yr-datasystem服务端组件镜像| -|[yr-datasystem helm chart](#获取yr-datasystem-helm-chart包)|-|yr-datasystem helm chart包| +|[Kubernetes](#安装kubernetes)|-|Kubernetes集群,用于编排和管理openYuanrong datasystem的容器| +|[Helm](#安装helm)|-|openYuanrong datasystem dscli的使用依赖Python环境| +|[Docker](#安装docker)|-|提供容器化平台,支持openYuanrong datasystem容器化部署和运行| +|[ETCD](#安装并部署etcd)|3.5|openYuanrong datasystem集群管理依赖组件| +|[openYuanrong datasystem镜像](#获取openYuanrong datasystem镜像)|-|openYuanrong datasystem服务端组件镜像| +|[openYuanrong datasystem helm chart](#获取openYuanrong datasystem-helm-chart包)|-|openYuanrong datasystem helm chart包| 下面给出以上软件的获取及安装方法。 @@ -362,7 +362,7 @@ yr-datasystem Kubernetes部署所需的依赖如下: 安装详情请参考 [安装并部署ETCD](#安装并部署etcd) 章节。 -#### 获取yr-datasystem镜像 +#### 获取openYuanrong datasystem镜像 - 通过镜像仓获取镜像: @@ -388,9 +388,9 @@ yr-datasystem Kubernetes部署所需的依赖如下: - image_name: 生成的镜像名 - image_tag: 生成的镜像Tag - 执行完成之后会在yr-datasystem/docker目录下生成一个build目录,build目录中会生成一个 `datasystem.tar` 文件,即为镜像压缩文件;与此同时本地的docker仓库中也会保存 `:` 的镜像。 + 执行完成之后会在openYuanrong datasystem/docker目录下生成一个build目录,build目录中会生成一个 `datasystem.tar` 文件,即为镜像压缩文件;与此同时本地的docker仓库中也会保存 `:` 的镜像。 -#### 获取yr-datasystem helm chart包 +#### 获取openYuanrong datasystem helm chart包 - 通过 dscli 命令行工具获取: @@ -409,7 +409,7 @@ yr-datasystem Kubernetes部署所需的依赖如下: ### 集群部署 -yr-datasystem通过 [/tmp/datasystem/values.yaml](#获取yr-datasystem-helm-chart包) 文件进行集群相关配置,其中必配项如下: +openYuanrong datasystem通过 [/tmp/datasystem/values.yaml](#获取openYuanrong datasystem-helm-chart包) 文件进行集群相关配置,其中必配项如下: ```yaml global: @@ -419,7 +419,7 @@ global: imageRegistry: "swr.cn-south-1.myhuaweicloud.com/openeuler/" # 镜像名字和镜像tag,需要替换为对应的版本号 images: - datasystem: "yr-datasystem:" + datasystem: "openYuanrong datasystem:" etcd: # ETCD集群地址 @@ -428,7 +428,7 @@ global: > 注意事项: > -> - 镜像仓地址与镜像名称获取请参考:[获取yr-datasystem镜像](#获取yr-datasystem镜像)。 +> - 镜像仓地址与镜像名称获取请参考:[获取openYuanrong datasystem镜像](#获取openYuanrong datasystem镜像)。 > - ETCD集群的部署与IP地址获取请参考:[安装并部署ETCD](#安装并部署etcd)。 配置完成后,通过 helm 命令即可轻松完成部署,命令如下: @@ -459,7 +459,7 @@ kubectl get pods -o wide ### 快速验证 -yr-datasystem会默认以DamonSet的方式在每个节点都部署一个 `ds-worker` Pod,默认监听 `<主机IP>:31501`,可通过如下 Python 脚本快速验证: +openYuanrong datasystem会默认以DamonSet的方式在每个节点都部署一个 `ds-worker` Pod,默认监听 `<主机IP>:31501`,可通过如下 Python 脚本快速验证: ```python from datasystem.ds_client import DsClient @@ -468,7 +468,7 @@ client = DsClient("127.0.0.1", 31501) client.init() ``` -当脚本执行未发生异常时说明yr-datasystem的客户端能正常连接上当前节点的ds-worker,部署成功。 +当脚本执行未发生异常时说明openYuanrong datasystem的客户端能正常连接上当前节点的ds-worker,部署成功。 ### 集群卸载 @@ -486,4 +486,4 @@ kubectl get pods -o wide # Pod列表中不存在ds-worker Pod ``` -当yr-datasystem所有的ds-worker Pod都退出时说明集群卸载成功。 \ No newline at end of file +当openYuanrong datasystem所有的ds-worker Pod都退出时说明集群卸载成功。 \ No newline at end of file diff --git a/docs/source_zh_cn/getting-started/install.md b/docs/source_zh_cn/getting-started/install.md index 24f6773e1922024f1556439ec3869ea2c8d87c6f..e0600098f18f2dda2c2592d52bdbc89250557b79 100644 --- a/docs/source_zh_cn/getting-started/install.md +++ b/docs/source_zh_cn/getting-started/install.md @@ -1,14 +1,14 @@ -# 安装yr-datasystem +# 安装openYuanrong datasystem -- [快速安装yr-datasystem](#快速安装yr-datasystem版本) +- [快速安装openYuanrong datasystem](#快速安装openyuanrong-datasystem版本) - [环境准备](#环境准备) - [安装Python](#安装python) - [安装CANN](#安装cann) - [pip安装](#pip安装) - [安装自定义版本](#安装自定义版本) -- [源码编译方式安装yr-datasystem](#源码编译方式安装yr-datasystem版本) +- [源码编译方式安装openyuanrong datasystem](#源码编译方式安装openyuanrong-datasystem版本) - [环境准备](#环境准备-1) - [安装Python](#安装python-1) - [安装CANN](#安装cann-1) @@ -18,32 +18,33 @@ - [安装git patch make libtool](#安装git-patch-make-libtool) - [安装CMake](#安装cmake) - [从代码仓下载源码](#从代码仓下载源码) - - [编译yr-datasystem](#编译yr-datasystem) - - [安装yr-datasystem](#安装yr-datasystem) + - [编译openyuanrong datasystem](#编译openyuanrong-datasystem) + - [安装openyuanrong datasystem](#安装openyuanrong-datasystem) -[![查看源文件](https://Mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](install.md) +[![查看源文件](https://Mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/openeuler/yuanrong-datasystem/blob/master/docs/source_zh_cn/getting-started/install.md) -本文档介绍如何在CPU/NPU环境的Linux系统上,快速安装yr-datasystem或者使用源码编译方式安装yr-datasystem。 +本文档介绍如何在CPU/NPU环境的Linux系统上,快速安装openYuanrong datasystem或者使用源码编译方式安装openYuanrong datasystem。 -## 快速安装yr-datasystem版本 +## 快速安装openYuanrong datasystem版本 ### 环境准备 -下表列出了运行yr-datasystem所需的系统环境和第三方依赖: +下表列出了运行openYuanrong datasystem所需的系统环境和第三方依赖: |软件名称|版本|作用| |-|-|-| -|EulerOS 2.8/openEuler 20.03|-|运行yr-datasystem的操作系统| -|[CANN](#安装cann)|8.0.0或8.0.rc2|运行异构相关特性的依赖库| -|[Python](#安装python)|3.10-3.11|yr-datasystem的运行依赖Python环境| +|openEuler|22.03|运行openYuanrong datasystem的操作系统| +|[CANN](#安装cann)|8.2.RC1|运行异构相关特性的依赖库| +|[Python](#安装python)|3.9-3.11|openYuanrong datasystem的运行依赖Python环境| 下面给出第三方依赖的安装方法。 #### 安装CANN -在[Ascend官网](https://www.hiascend.com/hardware/firmware-drivers/community?product=1&model=30&cann=8.0.0.beta1&driver=Ascend+HDK+24.1.RC3)下载CANN run包,安装 run 包: +在[Ascend官网](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.2.RC1)下载CANN run包,安装 run 包: ```bash +chmod +x ./Ascend-cann-toolkit__linux-.run ./Ascend-cann-toolkit__linux-.run --install ``` 执行以上命令会打屏华为企业业务最终用户许可协议(EULA)的条款和条件,请输入Y或y同意协议,继续安装流程。 @@ -83,8 +84,8 @@ conda init bash 创建虚拟环境,以Python 3.11.4为例: ```bash -conda create -n yr-datasystem_py311 python=3.11.4 -y -conda activate yr-datasystem_py311 +conda create -n py311 python=3.11.4 -y +conda activate py311 ``` 可以通过以下命令查看Python版本。 @@ -94,42 +95,62 @@ python --version ``` ### pip安装 + 安装PyPI上的版本: -```bash -pip install yr-datasystem -``` + +- 安装 openYuanrong datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): + ```bash + pip install openyuanrong-datasystem + ``` + +- 仅安装 openYuanrong datasystem Python SDK(不包含C++ SDK以及命令行工具): + ```bash + pip install openyuanrong-datasystem-sdk + ``` ### 安装自定义版本 -指定好yr_datasystem以及配套的Python版本,运行如下命令安装yr_datasystem包: -```bash -# 指定yr_datasystem版本为1.0.0 -export version="1.0.0" -# 指定Python版本为3.11 -export py_version="311" -pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/${version}/yr_datasystem/any/yr-datasystem-${version}-cp${py_version}-cp${py_version}-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple -``` -## 源码编译方式安装yr-datasystem版本 +指定好openYuanrong datasystem以及配套的Python版本,运行如下命令安装openYuanrong datasystem包: + +- 安装 openYuanrong datasystem 完整发行版(包含Python SDK、C++ SDK以及命令行工具): + ```bash + export version="0.5.0" + export py_version="$(python -c 'import sys;print(f"{sys.version_info.major}{sys.version_info.minor}")')" + export arch="$(uname -m)" + + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem-${version}-cp${py_version}-cp${py_version}-manylinux_2_34_${arch}.whl + ``` + +- 仅安装 openYuanrong datasystem Python SDK(不包含C++ SDK以及命令行工具): + ```bash + export version="0.5.0" + export py_version="$(python -c 'import sys;print(f"{sys.version_info.major}{sys.version_info.minor}")')" + export arch="$(uname -m)" + + pip install https://openyuanrong.obs.cn-southwest-2.myhuaweicloud.com/openyuanrong_datasystem_sdk-${version}-cp${py_version}-cp${py_version}-manylinux_2_34_${arch}.whl + ``` + +## 源码编译方式安装openYuanrong datasystem版本 (源码环境准备)= ### 环境准备 -下表列出了源码编译yr-datasystem所需的系统环境和第三方依赖: +下表列出了源码编译openYuanrong datasystem所需的系统环境和第三方依赖: |软件名称|版本|作用| |-|-|-| -|EulerOS 2.8/openEuler 20.03|-|编译和运行yr-datasystem的操作系统| -|[CANN](#安装cann)|8.0.0或8.0.rc2|编译和运行异构相关特性的依赖库| -|[Python](#安装python)|3.10-3.11|yr-datasystem的使用依赖Python环境| -|[wheel](#安装wheel和setuptools)|0.32.0及以上|yr-datasystem使用的Python打包工具| -|[setuptools](#安装wheel和setuptools)|44.0及以上|yr-datasystem使用的Python包管理工具| -|[GCC](#安装gcc)|7.3.0|用于编译yr-datasystem的C编译器| -|[G++](#安装g)|7.3.0|用于编译yr-datasystem的C++编译器| -|[libtool](#安装git-patch-make-libtool)|-|编译构建yr-datasystem的工具| -|[git](#安装git-patch-make-libtool)|-|yr-datasystem使用的源代码管理工具| -|[Make](#安装git-patch-make-libtool)|-|yr-datasystem使用的源代码管理工具| -|[CMake](#安装cmake)|3.18.3及以上|编译构建yr-datasystem的工具| -|[patch](#安装git-patch-make-libtool)|2.5及以上|yr-datasystem使用的源代码补丁工具| +|openEuler|22.04|编译和运行openYuanrong datasystem的操作系统| +|[CANN](#安装cann)|8.2.RC1|编译和运行异构相关特性的依赖库| +|[Python](#安装python)|3.10-3.11|openYuanrong datasystem的使用依赖Python环境| +|[wheel](#安装wheel和setuptools)|0.32.0+|openYuanrong datasystem使用的Python打包工具| +|[setuptools](#安装wheel和setuptools)|44.0+|openYuanrong datasystem使用的Python包管理工具| +|[GCC](#安装gcc)|7.3.0+|用于编译openYuanrong datasystem的C编译器| +|[G++](#安装g)|7.3.0+|用于编译openYuanrong datasystem的C++编译器| +|[libtool](#安装git-patch-make-libtool)|-|编译构建openYuanrong datasystem的工具| +|[git](#安装git-patch-make-libtool)|-|openYuanrong datasystem使用的源代码管理工具| +|[Make](#安装git-patch-make-libtool)|-|openYuanrong datasystem使用的源代码管理工具| +|[CMake](#安装cmake)|3.18.3+|编译构建openYuanrong datasystem的工具| +|[patch](#安装git-patch-make-libtool)|2.5+|openYuanrong datasystem使用的源代码补丁工具| 下面给出第三方依赖的安装方法。 @@ -212,12 +233,12 @@ source ${HOME}/Ascend/ascend-toolkit/set_env.sh ``` 详细介绍请参考[安装CANN](#安装cann)章节。 -### 编译yr-datasystem +### 编译openYuanrong datasystem -进入yr-datasystem根目录,然后执行编译脚本。 +进入openYuanrong datasystem根目录,然后执行编译脚本。 ```bash -cd yr-datasystem +cd yuanrong-datasystem bash build.sh ``` @@ -226,7 +247,7 @@ bash build.sh - `build.sh`中默认的编译线程数为8,如果编译机性能较差可能会出现编译错误,可在执行中增加-j{线程数}来减少线程数量。如`bash build.sh -j4`。 - 关于`build.sh`更多用法请执行`bash build.sh -h`获取帮助或者参看脚本头部的说明。 -### 安装yr-datasystem +### 安装openYuanrong datasystem ```bash pip install output/openyuanrong_datasystem-*.whl diff --git a/install_tools.sh b/install_tools.sh index 5cc30d363aef37542c73c21acf39208bc713b9bd..33483f51413ab894ac86e3e992045719c24160bc 100644 --- a/install_tools.sh +++ b/install_tools.sh @@ -309,4 +309,4 @@ if check_tools; then else echo "Error, init environment failed" exit 1 -fi +fi \ No newline at end of file diff --git a/setup.py b/setup.py index cd3295db28523426f6c4f6b30d3f0ea825b49ed6..b32bb8b4bf54b2a7a1b4f7d38ed6f9de9b9f654a 100644 --- a/setup.py +++ b/setup.py @@ -16,8 +16,11 @@ """setup_package.""" import os +import shutil import stat +import subprocess +from pathlib import Path from setuptools import find_packages, setup from setuptools.command.build_py import build_py from setuptools.command.egg_info import egg_info @@ -59,6 +62,47 @@ def build_depends(): build_depends() +def get_dependencies(file_path): + """ + get dependencieds of file + """ + dependencies = set() + ldd_path = shutil.which("ldd") + if ldd_path is None: + raise FileNotFoundError("cant find ldd, get dependencies failed") + result = subprocess.run([ldd_path, file_path], + capture_output=True, text=True, check=True) + output = result.stdout + for line in output.splitlines(): + line = line.strip() + if '=>' in line: + lib_name, lib_path = line.split('=>', 1) + lib_name = lib_name.strip() + lib_path = lib_path.strip().split()[0] + dependencies.add((lib_name)) + else: + lib_name = line.split()[0] + dependencies.add(lib_name) + return dependencies + + +def get_all_dependencies(): + """ + get all dependencies for datasystem + """ + all_dependencies = {"libdatasystem.so", "libds_client_py.so"} + src = os.path.join(os.path.dirname(__file__), 'datasystem', 'lib') + worker = os.path.join(os.path.dirname(__file__), 'datasystem', 'datasystem_worker') + src_path = Path(src) + bin_path = Path(worker) + all_dependencies.update(get_dependencies(bin_path)) + for item in src_path.rglob('*'): + all_dependencies.update(get_dependencies(item)) + return all_dependencies + +all_dependencies_for_datasystem = get_all_dependencies() + + def update_permissions(path): """ Update the permissions of files and directories within the specified path. @@ -102,6 +146,11 @@ class BuildPy(build_py): datasystem_lib_dir, 'lib', 'datasystem', 'datasystem_worker') os.chmod(worker_bin, stat.S_IREAD | stat.S_IWRITE | stat.S_IEXEC) os.system(f"strip --strip-all {worker_bin}") + lib_dir = os.path.join(os.path.dirname(__file__), 'build', 'lib', 'datasystem', 'lib') + lib_path = Path(lib_dir) + for item in lib_path.rglob('*'): + if item.name not in all_dependencies_for_datasystem: + item.unlink() class CustomBdistWheel(_bdist_wheel): diff --git a/third_party/P2P-Transfer/CMakeLists.txt b/third_party/P2P-Transfer/CMakeLists.txt index dd72bbf7cce69c9db887b8525c95e868b6d5e518..1c61460f83fac45e4104882dfc8486b1812659eb 100644 --- a/third_party/P2P-Transfer/CMakeLists.txt +++ b/third_party/P2P-Transfer/CMakeLists.txt @@ -6,7 +6,6 @@ project( p2p-transfer VERSION 0.1.0 DESCRIPTION "Fast NPU peer to peer transfer library" - HOMEPAGE_URL "https://example.com/SIR/usched/P2P-Transfer" LANGUAGES CXX )