diff --git a/docs/mindformers/docs/source_en/feature/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md
index 15d6da93f87596d6ad735fcca2f8cafad1ab6b9c..0551fe4ffe696d853a528f52ddb09d0910573c96 100644
--- a/docs/mindformers/docs/source_en/feature/resume_training.md
+++ b/docs/mindformers/docs/source_en/feature/resume_training.md
@@ -4,206 +4,178 @@
## Overview
-MindSpore Transformers supports **step-level resumable training**, which allows the checkpoints of a model to be saved during training. If the training is interrupted, you can load a saved checkpoint to resume the training. This feature is crucial for processing large-scale training tasks, and can effectively reduce time and resource waste caused by unexpected interruptions. In addition, to resume a training where the dataset remains unchanged but the `global batch size` is changed, for example, when the cluster is changed or the configuration is modified, this tool supports automatic scaling of the number of resumable training steps and skipped data steps in the same proportion.
+MindSpore Transformers supports **step-level resume training** functionality, enabling the loading of saved checkpoints to resume previous training states. This feature is particularly important for handling large-scale training tasks, as it effectively reduces time and resource waste caused by unexpected interruptions.
-## Configuration and Usage
+MindSpore Transformers supports saving and loading weights in both **ckpt** and **safetensors** formats. It supports various resume training scenarios such as **interrupted training resumption**, **strategy conversion resumption**, **incremental training resumption**, and **automatic recovery resumption**. It also supports different weight loading methods including **loading the last fully saved weights**, **loading weights from a specified step**, and **loading MindSpore merged weights** for resumption.
-### YAML Parameters
+In a distributed environment, resume training requires that weights from all nodes be stored in the **same shared directory**. Users can set the shared path via the environment variable `SHARED_PATHS`.
-You can modify the configuration file to control resumable training. The main parameters are as follows. For details about other parameters, see the description of CheckpointMonitor.
+## Introduction to Weight and Strategy Files
-| Parameter | Description |
-|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| load_checkpoint | Weight path loaded during resumable training. The path can be a folder path (used to load distributed weights) or a specific weight file path. The default value is an empty string, indicating that no weight is loaded (required for resumable training). When the configured path is an empty directory, the system will fall back to pretraining with randomly initialized weights. |
-| resume_training | Specifies whether to enable resumable training. You can set it to `True` or specify a weight file name. If the value is `True`, the system automatically resumes the training from the last interruption. The default value is `False`. |
-| load_ckpt_async | Determines whether to load model weights and compile in parallel (this configuration does not take effect when auto_trans_ckpt is set to true). The default value is False (serial execution).
When it is `True`, the parallel capability of loading ckpt weights and building model is enabled to reduce the overall time resume training. |
+MindSpore Transformers saves weight and strategy files, which are by default stored in the `output/checkpoint` and `output/strategy` folders. Users can modify the `output_dir` parameter in the YAML configuration to change the path of the `output` folder.
-Based on the input parameters, there are four cases.
+Weight files mainly store **network parameters**, **optimizer parameters**, and **resume training information**. Weight files are saved separately in rank-specific folders, and each rank folder maintains a `meta.json` file to record the last fully saved weight information for that rank. Taking a single-machine 8-card setup as an example, the weight saving format is as follows:
-| load_checkpoint | resume_training | Description | Recommended or Not |
-|---------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|
-| Weight file path | True | Resumes a training based on the weights specified by load_checkpoint. | √ |
-| Weight file path | Weight file name | The file name specified by resume_training is invalid. A training is resumed based on the weights specified by load_checkpoint. | × |
-| Weight folder path | True | **Scenario 1: Single-node system, multi-node system+shared directory, or ModelArts**
1. Resumes the training based on the weights recorded in meta.json files and supports fault recovery.
2. Resumes the training based on the latest weight of all ranks if the meta.json file of any rank is missing.
**Scenario 2: Multi-node+non-shared directory**
Resumes the training based on the latest weight of all ranks.
**Scenario 3: Automatically resume training**
To facilitate using the automatic training recovery feature, configure `load_checkpoint` as the save path for weight checkpoints, eliminating the need to manually modify this setting when resuming training. If the directory is empty during initial training, weights will initialize randomly normally; when resuming, training will recover from checkpoints saved in this directory. | √ |
-| Weight folder path | Weight file name | Resumes the training based on the weights specified by resume_training. | √ |
+```text
+output/checkpoint
+ ├── rank_0
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+ ├── rank_1
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+ ...
+ ├── rank_7
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+```
-In addition, you can modify the following parameters in the configuration file to use related functions.
+> The prefix of the weight name contains rank_id information, e.g., `llama3_1_8b_rank_0`. If a weight with the same prefix already exists when saving, an incremental suffix will be automatically added to the prefix to prevent overwriting old weights. For example, if "llama3_1_8b_rank_0" already exists, the prefix will be updated to "llama3_1_8b_rank_0_1", and if "llama3_1_8b_rank_0_1" also exists, it will be updated to "llama3_1_8b_rank_0_2".
-| Parameter | Description |
-|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| ignore_data_skip | Specifies whether to ignore the mechanism of skipping data during resumable training and read the dataset from the beginning instead. This parameter is used when the dataset is changed during resumable training. If this parameter is set to `True`, no data is skipped. The default value is `False`. |
-| data_skip_steps | Number of steps skipped for the dataset. This parameter is used when the training is interrupted again after being resumed because the dataset or `global batch size` is changed. You need to manually set this parameter to configure the number of steps skipped for the new dataset. If the `global batch size` is changed, you need to divide and round down its value by the scaling coefficient and then specify the result as the value of this parameter. |
+Strategy files are only saved in distributed training tasks and are used for **weight strategy conversion**. Strategy files are saved in ckpt format with the rank_id as the suffix, mainly recording the network and optimizer sharding information for the current rank. Taking a single-machine 8-card setup as an example, the strategy file saving format is as follows:
-### Fault Recovery Mechanism
+```text
+output/strategy
+ ├── ckpt_strategy_rank_0.ckpt
+ ├── ckpt_strategy_rank_1.ckpt
+ ...
+ └── ckpt_strategy_rank_7.ckpt
+```
-If `resume_training` is set to `True`, the system automatically resumes training based on the weights recorded in `meta.json`. If the weight file of a rank is missing or damaged, the system rolls back to the latest available weight for recovery.
+> Strategy files will overwrite old files when saved. To prevent overwriting or mixing strategy files from different tasks, please promptly save strategy files to a custom folder.
-> In a distributed environment, resumable training requires that the weights of all nodes be in the same shared directory. You can use the `SHARED_PATHS` environment variable to set the shared path.
+For more information about weights, refer to [Ckpt Weights](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html) and [Safetensors Weights](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html).
-## Example of Distributed Training
+## YAML Parameter Configuration Description
-The following example shows how to enable resumable training in single-device and multi-device environments. The example is based on the `llama3.1 8b` model.
-For related configuration files, see [research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml).
+| Parameter | Description |
+| ------------------------ | ------------------------------------------------------------ |
+| load_checkpoint | Path to the weight file or folder, **required for resuming training**, default is an empty string.
If the configured path is an empty directory, it will fall back to using randomly initialized weights for pre-training.
For single-card weights, configure the path to the weight file, ensuring the parent directory does not start with "rank_". |
+| src_strategy_path_or_dir | Path to the strategy file or folder, required when **`auto_trans_ckpt=True` and load_checkpoint is a distributed weight**, default is an empty string.
If the weights configured in load_checkpoint do not have pipeline parallel sharding, configure any strategy file path; otherwise, configure the strategy folder path. |
+| auto_trans_ckpt | Switch for automatic weight conversion, needs to be enabled when the **weights configured in load_checkpoint do not match the distributed strategy of the current task**, default is False. |
+| transform_process_num | Number of processes used for automatic weight conversion, **only applicable to automatic conversion of ckpt format weights**, which can accelerate weight conversion. Default is `None` (disabled).
The set value must be divisible by the total number of cluster cards. A larger value increases host memory usage; reduce the number of processes if host memory is insufficient. |
+| resume_training | Switch for resuming training, can be set to `True` or the weight file name in any rank sub-folder. Default is `False`.
When set to `True`, it **loads the last fully saved weights** for resumption.
When set to a weight file name, it **loads the weights from the specified step** for resumption. |
+| load_ckpt_format | Format of the weights configured in load_checkpoint, can be set to `safetensors` or `ckpt`, default is `ckpt`. |
+| remove_redundancy | Switch for loading without redundancy, needs to be enabled when the weights configured in load_checkpoint are **safetensors format weights saved without redundancy**, default is False. |
+| load_ckpt_async | Whether to execute weight loading in parallel with model compilation. This configuration **only applies to asynchronous loading scenarios with ckpt format weights and unchanged distributed strategy**. Default is `False`. |
-### Complete Training
+## Introduction to Resume Training Scenarios
-1. Modify `research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`.
+### Interrupted Training Resumption
- For initial training with randomly initialized weights followed by resume training without changing the configuration file, set `resume_training` to `True` and `load_checkpoint` to the directory where checkpoints will be saved:
+**Overview**: Resume training based on saved weights after an unexpected interruption of a normal training task, without changing the distributed strategy.
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
+- Resume training from the last fully saved weights
- > Use an empty directory for `load_checkpoint` only if it is intended for saving checkpoints; otherwise, the next run will start from scratch instead of resuming.
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ resume_training: True
+ ```
- Configure the parallelism as required.
+ The system will automatically search for and load the last fully saved weights based on the weight records in each rank's `meta.json` for resumption.
- ```yaml
- parallel_config:
- data_parallel: 1
- model_parallel: 2
- pipeline_stage: 2
- micro_batch_num: 2
- ```
+ > If there is no meta.json in all rank sub-folders of the weight folder, it will fall back to resuming from the weights with the latest timestamp for each rank.
- Configure the model weight saving as required.
+- Resume training from weights of a specified step
- ```yaml
- callbacks:
- ...
- - type: CheckpointMonitor
- prefix: "llama3_1_8b"
- save_checkpoint_steps: 10
- keep_checkpoint_max: 3
- integrated_save: False
- async_save: False
- ...
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ # For ckpt weights, fill in {prefix}-{epoch}_{step}.ckpt
+ resume_training: {prefix}-{epoch}_{step}.safetensors
+ ```
-2. Prepare a dataset. The following uses [alpaca datasets](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/README.md#%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87) as an example to describe how to start four-device distributed training.
+ Users must ensure the integrity of the specified weights. Each rank will automatically replace the rank information in the "prefix" to update the weight name to be loaded. For example, if the specified weight name is `llama3_1_8b_rank_0-200_1.safetensors`, when loading rank_1, the weight name will be replaced with `llama3_1_8b_rank_1-200_1.safetensors`. An error will occur if the weight is missing for a certain rank.
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
+### Strategy Conversion Resumption
- After the fourth saving is complete, end the process. The structure of the `rank_0` folder under `checkpoint` is as follows:
+**Overview**: Continue training after modifying the **distributed strategy** or **expanding/shrinking the cluster scale**, requiring **enabling automatic weight conversion**.
- ```text
- checkpoint/rank_0
- ├── llama3_1_8b_rank_0-10_2.ckpt
- ├── llama3_1_8b_rank_0-15_2.ckpt
- ├── llama3_1_8b_rank_0-20_2.ckpt
- └── meta.json
- ```
+#### Safetensors Weights
-### Resumable Training
+Enabling automatic weight conversion will automatically merge safetensors weights into [full weights](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html#full-weights) for distributed loading. The merged safetensors weights will be saved to the `output/unified_checkpoint` folder. If the weights have been offline merged into [full weights](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html#full-weights), they will be directly loaded in a distributed manner. For offline merging steps, refer to the [Safetensors Weights - Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html) section.
-1. If `resume_training` is set to `False` in the pre-training configuration, update the configuration to specify the resumable training weight file.
+- Resume training from the last fully saved weights
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ src_strategy_path_or_dir: /path/to/strategy
+ resume_training: True
+ auto_trans_ckpt: True
+ ```
-2. Resume training.
+- Resume training from weights of a specified step
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ src_strategy_path_or_dir: /path/to/strategy
+ resume_training: {prefix}-{epoch}_{step}.safetensors
+ auto_trans_ckpt: True
+ ```
- If the initial number of steps is `42`, the training is resumed successfully. The saved weight file contains the information about step `40`. The default value of `sink_size` is `2`, indicating that the information is printed every two steps. Therefore, the initial number of steps is `42`.
+- Resume training from merged weights
-### Resumable Training with the Dataset Changed
+ ```yaml
+ load_checkpoint: /path/to/unified_checkpoint
+ resume_training: True
+ auto_trans_ckpt: True
+ ```
-There are three main scenarios where the dataset is changed in resumable training. You need to modify the configuration file in each scenario. The following describes each case one by one, and describes in detail which step of the basic resumable training process needs to be modified, and how to modify a specific configuration to achieve an expected effect.
+#### Ckpt Weights
-**Scenario 1: Training resumed with a new dataset (but not skipping trained steps)**
+Enabling automatic weight conversion will automatically convert weights to the distributed strategy of the current task before loading. The converted ckpt weights will be saved to the `output/transformed_checkpoint` folder, which can be directly loaded for subsequent use without enabling weight automatic conversion.
-In this scenario, when the new dataset is used, the model training starts from scratch without skipping any data or steps. In this case, you need to set the configuration file **to ignore the previous data progress** so that the model can be trained from scratch based on the new dataset.
+If there are multiple step weight files in the rank sub-folder of the weights, it is necessary to offline filter the weights to ensure that **each rank sub-folder contains only a single ckpt file to be loaded**.
-- **Configuration modification**: You need to set `ignore_data_skip` based on the first step of the basic resumable training process. Set `ignore_data_skip` to `True`, indicating that no data is skipped.
+```yaml
+load_checkpoint: /path/to/checkpoint
+src_strategy_path_or_dir: /path/to/strategy
+resume_training: True
+auto_trans_ckpt: True
+transform_process_num: 8
+```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: True
- ```
+### Incremental Training Resumption
-- **Expected result**: The model is trained from scratch based on the new dataset without skipping any steps.
+**Overview**: The training dataset needs to be **produced and trained incrementally**. After training on the current dataset, new produced datasets are added for continued training until all datasets are processed. This scenario requires users to preset the total steps of the learning rate curve in advance based on the total amount of training data.
-**Scenario 2: Training resumed with a new dataset, skipping trained steps**
+Assume a total of 10T tokens of data will be trained, with each produced dataset containing 1T tokens. The entire training process is completed in 10 epochs, requiring a total of 100,000 steps.
-In this case, the model has been partially trained based on the new dataset (for example, `2` steps have been performed before the training is interrupted), and the training is expected to continue from the last interruption. In this case, you must manually specify the number of steps to be skipped.
+- Step 1: Preset the total training steps to fix the learning rate curve for the entire training process
-- **Configuration modification**: You need to set `ignore_data_skip` and `data_skip_steps` based on the first step of the basic resumable training process. Set `ignore_data_skip` to `False` and use `data_skip_steps` to specify the number of trained steps to skip (for example, `2`).
+ ```yaml
+ lr_schedule:
+ total_steps: 100000
+ ```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: False
- data_skip_steps: 2
- ```
+- Step 2: Set a sufficiently large epoch value to ensure all datasets can be trained
-- **Expected result**: The model skips the first `2` steps and continues the training from step `3` based on the new dataset.
+ ```yaml
+ runner_config:
+ epochs: 15
+ ```
-**Scenario 3: Training resumed with a new dataset and `global batch size` changed**
+ > The learning rate curve for the entire training process is fixed, and the epoch value setting will not affect the learning rate. You can set a larger value to ensure that all 10 datasets are fully trained.
-If `global batch size` is changed (for example, doubled) when a training is resumed based on a new dataset, you need to scale the number of steps that have been performed when manually specifying the number of steps to be skipped. Specifically, the number of skipped steps needs to be divided and rounded down based on the scaling coefficient. For example, if the value of `global batch size` is changed to `2` times of the original value, the number of steps that need to be skipped is halved.
+- Step 3: After training 1 epoch of the dataset, replace the dataset and resume training. The following example resumes from the last fully saved weights; for other resumption methods, refer to [Interrupted Training Resumption](#interrupted-training-resumption) or [Strategy Conversion Resumption](#strategy-conversion-resumption).
-- **Configuration modification**: Adjust `data_skip_steps` based on Scenario 2. Set `data_skip_steps` to the number of steps after scaling. For example, if `global batch size` is changed to `2` times of the original value, the number of steps to be skipped is changed to `1` (rounded down).
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ resume_training: True
+ ```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: False
- data_skip_steps: 1
- ```
+ > Due to inconsistent sample counts across datasets, the displayed epoch and step may change when resuming with a new dataset. However, the total number of training steps remains unchanged, which is a normal phenomenon.
-- **Expected result**: The model adjusts the number of skipped steps based on the new setting of `global batch size` and continues the training from the specified position.
+### Automatic Recovery Resumption
-### Fault Recovery Example
+**Overview**: To facilitate automatic resumption of training by the platform without manual intervention, configure load_checkpoint to the save path of weight checkpoints. During the first training run, this directory is empty, and training will start normally with randomly initialized weights. For resumption, training will resume from the last fully saved weights in this directory.
-If some weight files are missing, the system automatically restores the files based on the latest available weight.
+```yaml
+load_checkpoint: /path/to/output/checkpoint
+resume_training: True
+```
-1. Delete the `llama3_1_8b_rank_0-20_2.ckpt` file from the `rank_3` directory. The folder structure after the deletion is as follows:
+## Notes and Recommendations
- ```text
- checkpoint/rank_3
- ├── llama3_1_8b_rank_0-10_2.ckpt
- ├── llama3_1_8b_rank_0-15_2.ckpt
- └── meta.json
- ```
-
-2. Modify the configuration to enable fault recovery.
-
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
-
-3. Start distributed training.
-
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
-
- If the initial number of steps is `32`, the training is resumed successfully. Because the weight of the information in step `40` under `rank_3` is deleted, the weight saved last time, that is, the weight of the information in step `30`, is automatically used. The default value of `sink_size` is `2`, indicating that information is printed every two steps. Therefore, the initial number of steps is `32`.
-
-## Precautions
-
-- **Data offloading**: You must enable data offloading and configure `sink_mode=True` for distributed resumable training.
-- **Weight file check**: Ensure that the weights loaded for resumable training are the ones saved when the training is interrupted instead of in the entire training process. Otherwise, an error is reported.
+- Distributed resume training must enable **data sinking mode** by configuring `sink_mode=True`.
+- It is recommended to set the `SHARED_PATHS` environment variable to the path of the top-level shared directory. For example, if `/data01` is the shared directory and the project directory is under it, configure `export SHARED_PATHS=/data01`.
+- It is recommended to save weights and strategy files of training tasks with different distributed strategies in separate folders.
diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
index 08108c9c6b1cb391828b451dbcab0a8e37485e9f..267a2ae23bba5cb1dfff10c16a0d0b74e9ab45de 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
@@ -4,206 +4,178 @@
## 概述
-MindSpore Transformers支持**step级断点续训**功能,允许在训练中保存模型的checkpoint,并在训练中断后,加载保存的checkpoint恢复之前的状态继续训练。这一特性在处理大规模训练任务时尤为重要,能够有效减少因意外中断导致的时间和资源浪费。此外,在数据集不变,但`global batch size`改变的断点续训场景下,例如更换集群或修改配置时,本工具还支持续训步数与数据跳过步数自动同比例缩放。
+MindSpore Transformers支持**step级断点续训**功能,支持加载已保存的checkpoint来恢复之前的状态继续训练。这一特性在处理大规模训练任务时尤为重要,能够有效减少因意外中断导致的时间和资源浪费。
-## 配置与使用
+MindSpore Transformers支持保存和加载**ckpt**、**safetensors**两种格式权重,支持**中断续训**、**策略转换续训**、**增量续训**、**自动恢复续训**等多种续训场景,以及支持**加载最后保存完整的权重**、**加载指定step权重**、**加载MindSpore合并的权重**续训等不同的权重加载方式。
-### YAML参数配置
+分布式环境中,断点续训要求所有节点的权重在**同一共享目录**下。用户可通过环境变量`SHARED_PATHS`来设置共享路径。
-用户可通过修改配置文件来控制断点续训的行为。以下是主要参数,其他参数可参考CheckpointMonitor介绍:
+## 权重和策略文件介绍
-| 参数 | 描述 |
-| --------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| load_checkpoint | 断点续训时加载的权重路径。路径可以是文件夹路径(用于加载分布式权重),也可以是具体权重文件的路径。默认为空字符串,即不加载权重(断点续训时必填)。当配置的路径为空目录时,会退化为使用随机初始化权重进行预训练。|
-| resume_training | 断点续训开关,可设置为`True`或指定特定的权重文件名。为`True`时,系统会自动从上次中断处恢复训练。默认为`False`。 |
-| load_ckpt_async | 是否将加载权重与模型编译的操作并行执行。不支持在线自动切分权重场景(auto_trans_ckpt=True),该场景下不生效。默认为False串行执行。
为`True`时,并行执行,减少总体拉起续训的耗时。 |
+MindSpore Transformers保存权重和策略文件,默认保存在`output/checkpoint`和`output/strategy`两个文件夹下,用户可以修改yaml配置的`output_dir`参数修改`output`文件夹路径。
-根据传入参数不同,可分为如下四种情况:
+权重文件主要保存了**网络参数**、**优化器参数**和**续训信息**,权重文件根据rank文件夹分开保存,每个rank文件夹下单独维护一个`meta.json`文件用以记录当前rank最后保存完整的权重信息。以单机8卡为例,权重保存格式如下:
-| load_checkpoint | resume_training | 功能描述 | 是否为推荐使用方式 |
-|-----------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
-| 权重文件路径 | True | 基于load_checkpoint指代的权重续训 | √ |
-| 权重文件路径 | 权重文件名 | resume_training指代的文件名无效,基于load_checkpoint指代的权重续训 | × |
-| 权重文件夹路径 | True | **场景1:"单机"或"多机+共享目录"或"ModelArts"**
① 基于meta.json记录的权重续训,支持故障恢复。
② 若任一rank文件夹下缺少meta.json,所有rank基于最后时间戳的权重续训。
**场景2:"多机+非共享目录"**
所有rank基于最后时间戳的权重续训。
**场景3:"自动恢复训练"**
为方便自动恢复训练功能的使用,可以将load_checkpoint配置为权重checkpoint的保存路径,这样在续训时不需要对配置项load_checkpoint做手动修改。首次开始训练时,该目录为空,会正常随机初始化权重;续训时,会从该目录下保存的checkpoint恢复训练。 | √ |
-| 权重文件夹路径 | 权重文件名 | 基于resume_training指代的权重续训 | √ |
+```text
+output/checkpoint
+ ├── rank_0
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+ ├── rank_1
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+ ...
+ ├── rank_7
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.safetensors
+```
-此外,用户还可通过增改配置文件的如下参数来使用相关功能。
+> 权重名的prefix中携带rank_id信息,如:llama3_1_8b_rank_0;若保存权重时已存在相同prefix的权重,prefix会自动添加自增后缀以防止旧权重被覆盖。如"llama3_1_8b_rank_0"已存在时,prefix会更新为"llama3_1_8b_rank_0_1",若"llama3_1_8b_rank_0_1"也已存在,prefix会更新为"llama3_1_8b_rank_0_2"。
-| 参数 | 描述 |
-|------------------|-------------------------------------------------------------------------------------------------------------|
-| ignore_data_skip | 是否忽略断点续训时跳过数据的机制,而从头开始读取数据集。用于续训时数据集更换的场景。设置为`True`时不会跳过数据集,默认为`False`。 |
-| data_skip_steps | 数据集跳过步数。用于更换数据集续训后再次断开续训或`global batch size`改变的场景,须手动设置此参数来配置新数据集跳过步数,如`global batch size`改变,需向下整除缩放系数后再传入。 |
+策略文件仅在分布式训练任务中保存,用于**权重策略转换**。策略文件以rank_id作为后缀,固定保存为ckpt格式的文件,主要记录了当前rank的网络和优化器切分信息。以单机8卡为例,策略文件保存格式如下:
-### 故障恢复机制
+```text
+output/strategy
+ ├── ckpt_strategy_rank_0.ckpt
+ ├── ckpt_strategy_rank_1.ckpt
+ ...
+ └── ckpt_strategy_rank_7.ckpt
+```
-当`resume_training`设置为`True`时,系统会自动基于`meta.json`记录的权重进行续训。如果某个rank的权重文件缺失或损坏,系统会回退到上一个可用的权重进行恢复。
+> 注:策略文件保存时会覆盖旧文件,为防止覆盖或混杂不同任务的策略文件,请及时将策略文件保存到自定义文件夹。
-> 分布式环境中,断点续训要求所有节点的权重在同一共享目录下。用户可通过环境变量`SHARED_PATHS`来设置共享路径。
+可参考[Ckpt权重](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html)和[Safetensors权重](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/safetensors.html),获取更多权重相关信息。
-## 分布式训练示例
+## YAML参数配置说明
-以下示例演示了如何在单卡和多卡环境中启动断点续训。示例基于 `llama3.1 8b` 模型,相关配置文件[research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)。
+| 参数 | 描述 |
+| ------------------------ | ------------------------------------------------------------ |
+| load_checkpoint | 权重文件或文件夹路径,**断点续训时必填**,默认为空字符串。
当配置的路径为空目录时,会退化为使用随机初始化权重进行预训练。
若为单卡权重,可配置为权重文件路径,需要确保文件父目录不以"rank_"开头。 |
+| src_strategy_path_or_dir | 策略文件或文件夹路径,**`auto_trans_ckpt=True`且load_checkpoint为分布式权重**时需要配置,默认为空字符串。
若load_checkpoint配置的权重不带流水线并行切分,则可配置为任一策略文件路径,否则配置为策略文件夹路径。 |
+| auto_trans_ckpt | 权重自动转换开关,load_checkpoint配置的**权重和当前任务的分布式策略不匹配**时需要开启,默认为False。 |
+| transform_process_num | 权重自动转换使用进程数,**仅适用于ckpt格式权重的自动转换**,可加速权重转换。默认为`None`不开启。
设置值需要能够整除集群总卡数,设置值越大,host内存占用越高,若host内存不足,需要减少进程数。 |
+| resume_training | 断点续训开关,可设置为`True`或任一rank子文件夹下的权重文件名。默认为`False`。
为`True`时,**加载最后保存完整的权重**续训。
为权重文件名时,**加载指定step的权重**续训。 |
+| load_ckpt_format | load_checkpoint配置的权重格式,可配置为`safetensors`或`ckpt`,默认为`ckpt`。 |
+| remove_redundancy | 去冗余加载开关,load_checkpoint配置的权重为**去冗余保存的safetensors格式权重**时需要开启,默认为`False`。 |
+| load_ckpt_async | 是否将加载权重与模型编译的操作并行执行。该配置**仅适用于ckpt格式权重且分布式策略不变**的异步加载场景。默认为`False`。 |
-### 完整训练
+## 断点续训使用场景介绍
-1. 修改`research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`:
+### 中断续训
- 如果想首次运行随机初始化训练,并且后续断点续训不改配置文件,可在此时将`resume_training`设置为`True`,并将`load_checkpoint`设为即将保存权重的目录:
+**概述**:正常训练任务异常中断,不改变分布式策略,基于保存的权重重新恢复训练任务。
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
+- 基于最后保存完整的权重续训
- > 一旦目录为空目录,模型权重即会自动随机初始化。因此,如果误设了一个非即将保存权重的空目录,会导致第二次拉起任务时训练从头开始。
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ resume_training: True
+ ```
- 根据需要设置并行配置:
+ 系统会自动基于各rank的`meta.json`记录的权重,搜索并加载最后保存完整的权重进行续训。
- ```yaml
- parallel_config:
- data_parallel: 1
- model_parallel: 2
- pipeline_stage: 2
- micro_batch_num: 2
- ```
+ > 若权重文件夹的所有rank子文件夹下均无meta.json,则退化为基于各自rank最后时间戳的权重续训。
- 根据需要设置模型权重保存配置:
+- 基于指定step的权重续训
- ```yaml
- callbacks:
- ...
- - type: CheckpointMonitor
- prefix: "llama3_1_8b"
- save_checkpoint_steps: 10
- keep_checkpoint_max: 3
- integrated_save: False
- async_save: False
- ...
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ # 若为ckpt权重,则填写{prefix}-{epoch}_{step}.ckpt
+ resume_training: {prefix}-{epoch}_{step}.safetensors
+ ```
-2. 准备数据集,此处以 [alpaca 数据集](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/README.md#%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)为例,启动4卡分布式训练:
+ 用户需确保指定权重的完整性。各rank会自动替换"prefix"中的rank信息来更新要加载的权重名,比如指定的权重名为`llama3_1_8b_rank_0-200_1.safetensors`,rank_1加载时会将权重名替换为`llama3_1_8b_rank_1-200_1.safetensors`。若某rank下权重缺失,会报错权重文件找不到。
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
+### 策略转换续训
- 在第四次保存完毕后,结束进程,此时 `checkpoint` 下的 `rank_0` 文件夹结构为:
+**概述**:修改了**分布式策略**或**扩大/缩小集群规模**继续训练任务,需要**开启权重自动转换**。
- ```text
- checkpoint/rank_0
- ├── llama3_1_8b_rank_0-10_2.ckpt
- ├── llama3_1_8b_rank_0-15_2.ckpt
- ├── llama3_1_8b_rank_0-20_2.ckpt
- └── meta.json
- ```
+#### safetensors权重
-### 断点续训
+开启权重自动转换,系统会自动合并safetensors权重为[完整权重](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/safetensors.html#完整权重)后进行分布式加载,合并的safetensors权重会落盘到`output/unified_checkpoint`文件夹下;若已经将权重离线合并为[完整权重](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/safetensors.html#完整权重),则会直接进行分布式加载。离线合并步骤请参考[Safetensors权重-权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/safetensors.html)章节。
-1. 如果在前置训练的配置中,`resume_training`为`False`,此时需修改配置,指定断点续训权重文件:
+- 基于最后保存完整的权重续训
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ src_strategy_path_or_dir: /path/to/strategy
+ resume_training: True
+ auto_trans_ckpt: True
+ ```
-2. 启动断点续训:
+- 基于指定step的权重续训
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ src_strategy_path_or_dir: /path/to/strategy
+ resume_training: {prefix}-{epoch}_{step}.safetensors
+ auto_trans_ckpt: True
+ ```
- 如若初始步数从第`42`步开始,则断点续训成功。由于最后保存的权重包含了第`40`步的信息,`sink_size`默认为`2`,即每两步打印一次信息,因此初始步数为`42`。
+- 基于合并的权重续训
-### 切换数据集断点续训
+ ```yaml
+ load_checkpoint: /path/to/unified_checkpoint
+ resume_training: True
+ auto_trans_ckpt: True
+ ```
-在切换数据集并进行断点续训时,有三种主要场景,每个场景需要针对配置文件进行不同的修改。下面逐一介绍每种情况,并详细说明在哪些场景下需要对基本断点续训流程的哪一步进行修改,以及如何修改具体配置来达成预期效果。
+#### ckpt权重
-**场景一:全新数据集,继续训练(无需跳过已训练的步数)**
+开启权重自动转换,系统会自动转换权重到当前任务的分布式策略后进行加载,转换的ckpt权重会落盘到`output/transformed_checkpoint`文件夹下,可用于后续直接加载使用且无需开启权重自动转换。
-在这种场景中,当切换到一个全新数据集时,模型的训练将从新数据集的开头开始,而无需跳过任何步数。对于这种情况,配置文件需要设置为**忽略之前的数据进度**,让模型在新数据集上从头训练。
+若权重的rank子文件夹下存在多个step的权重文件,需要离线对权重进行筛选,确保**每个rank子文件夹下只有需要加载的单个ckpt文件**。
-- **配置修改**:需要在基本断点续训流程的第一步的基础上对`ignore_data_skip`进行设置。将`ignore_data_skip`设置为`True`,表示不跳过数据集。
+```yaml
+load_checkpoint: /path/to/checkpoint
+src_strategy_path_or_dir: /path/to/strategy
+resume_training: True
+auto_trans_ckpt: True
+transform_process_num: 8
+```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: True
- ```
+### 增量续训
-- **预期效果**:模型将在新数据集上从头训练,而不会跳过任何步数。
+**概述**:训练数据集需要**边生产边训练**,当前数据集训练结束后,加入新生产的数据集继续训练,直到所有数据集训练完毕。该场景需要用户基于训练的总数据量,提前预设学习率曲线的总步数。
-**场景二:在新数据集上断点续训,并跳过部分已训练的步数**
+假设一共训练10T tokens数据,每次生产的数据集只包含1T tokens数据,整个训练过程分10个epoch训完,一共需要花费100000steps。
-在这种情况下,模型在新数据集上已经训练了一部分(例如断开前已训练了`2`步),期望从上次中断的地方继续训练。此时,必须手动指定需要跳过的步数。
+- 步骤1:预设总训练步数,固定整个训练流程的学习率曲线
-- **配置修改**:需要在基本断点续训流程的第一步的基础上对`ignore_data_skip`和`data_skip_steps`进行设置。将`ignore_data_skip`设置为`False`,并且通过`data_skip_steps`指定要跳过的已训练步数(例如,跳过`2`步)。
+ ```yaml
+ lr_schedule:
+ total_steps: 100000
+ ```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: False
- data_skip_steps: 2
- ```
+- 步骤2:设置足够大的epoch值,确保能够训完所有数据集
-- **预期效果**:模型将跳过新数据集的前`2`步,从第`3`步开始继续训练。
+ ```yaml
+ runner_config:
+ epochs: 15
+ ```
-**场景三:在新数据集上断点续训时,`global batch size`发生变化**
+ > 整个训练过程的学习率曲线已固定,epochs值设置不会影响学习率,可以设置较大值,确保能训完10个数据集。
-如果在新数据集上继续断点续训时,`global batch size`改变了(例如,变为原先的 2 倍),手动指定需跳过的步数时需要对已训练的步数进行缩放。具体来说,跳过的步数需要根据缩放系数向下整除。例如,如果`global batch size`变为原先的`2`倍,需跳过的步数则相应减少一半。
+- 步骤3:数据集训完1个epoch后,可以更换数据集续训,如下为基于最后保存完整的权重续训,其他续训方式请参考[中断续训](#中断续训)或[策略转换续训](#策略转换续训)。
-- **配置修改**:需要在场景二的基础上对`data_skip_steps`进行调整。将`data_skip_steps`设置为缩放后的步数。例如,`global batch size`变为原先的`2`倍,需跳过的步数变为`1`(向下整除)。
+ ```yaml
+ load_checkpoint: /path/to/checkpoint
+ resume_training: True
+ ```
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ignore_data_skip: False
- data_skip_steps: 1
- ```
+ > 由于各个数据集样本数量不一致,更换数据集续训,显示的epoch和step可能发生变化,但是当前训练的总step数不变,为正常现象。
-- **预期效果**:模型将根据新的`global batch size`调整跳过的步数,并从正确的地方继续训练。
+### 自动恢复续训
-### 故障恢复示例
+**概述**:为方便平台能够自动拉起断点续训,无需人工干预,可以将load_checkpoint配置为权重checkpoint的保存路径,首次开始训练时,该目录为空,会正常随机初始化权重;续训时,会基于该目录下最后保存完整的权重恢复训练。
-当部分权重文件缺失时,系统会自动基于上一个可用的权重进行恢复。
+```yaml
+load_checkpoint: /path/to/output/checkpoint
+resume_training: True
+```
-1. 删除`rank_3`下的`llama3_1_8b_rank_0-20_2.ckpt`文件。删除后文件夹结构应为:
+## 注意事项和建议
- ```text
- checkpoint/rank_3
- ├── llama3_1_8b_rank_0-10_2.ckpt
- ├── llama3_1_8b_rank_0-15_2.ckpt
- └── meta.json
- ```
-
-2. 修改配置,启用故障恢复:
-
- ```yaml
- load_checkpoint: './output/checkpoint'
- resume_training: True
- ```
-
-3. 启动分布式训练:
-
- ```shell
- bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --train_dataset /path/to/alpaca-fastchat8192.mindrecord \
- --run_mode train \
- --use_parallel True" 4
- ```
-
- 如若初始步数从第`32`步开始,则断点续训成功。由于`rank_3`下的包含了第`40`步的信息的权重被删除,因此自动使用上一次保存的权重,即包含第
- `30`步信息的权重。由于`sink_size`默认为`2`,即每两步打印一次信息,因此初始步数为`32`。
-
-## 注意事项
-
-- **数据下沉模式**:分布式断点续训必须开启数据下沉模式,配置`sink_mode=True`。
-- **权重文件检查**:确保断点续训加载的权重为训练中断时的权重,而不是整个训练过程最后保存的权重,否则会报错。
+- 分布式断点续训必须开启**数据下沉模式**,配置`sink_mode=True`。
+- 建议配置`SHARED_PATHS`环境变量为最上层共享目录路径,比如`/data01`是共享目录,工程目录在该目录下,配置`export SHARED_PATHS=/data01`。
+- 建议不同分布式策略训练任务的权重和策略文件分开文件夹保存。