# NPU-IIL-繁星-AscendCOperatorS2

**Repository Path**: Nicet/npu-ill-stars-operator-s2

## Basic Information

- **Project Name**: NPU-IIL-繁星-AscendCOperatorS2
- **Description**: NPU-IIL-繁星 于昇腾算子挑战赛S2赛季两个赛道的算子仓
- **Primary Language**: C
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-06-26
- **Last Updated**: 2024-09-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 昇腾原生算子开发挑战赛S2赛季-NPU-IIL-繁星

## 算子开发流程 以Gelu为例

0. 创建算子工程，创建算子目录，根据算子原型编写json配置文件
```
/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/bin/msopgen gen -i ./Gelu.json -c ai_core-Ascend310B -lan cpp -out ./Gelu
```
1. 修改CMakePresets.json的toolkit路径
2. 编写op_host、op_kernel
3. `cd S2/9.Gelu/Gelu && ./build.sh && cd build_out && ./c*.run`
4. 编写测试，需要修改`scripts/gen_data.py`、`scripts/verify_result.py`、`src/main.cpp`、`src/op_runner.cpp`、`run.sh`、`inc/operator_desc.h`
5. 执行测试
```
rm -rf ./*put && bash run.sh
```

## 提交结果-基础算子开发赛道

1. ThreeNN 4/5
    - 考虑0级API切片运算
2. Cumsum 4/5
    - fp16, 使用fp32记录，元素个数一多就会出现精度问题->每次赋值完再重新取值
4. GlobalAvgPool 5/5
    - 朴素实现，未设定好ub，buffernumber为1
    - fp16求ReduceSum之前转为fp32
    - 使用了最朴素的访存仍旧是2/5
    - https://github.com/onnx/onnx/blob/main/docs/Operators.md#GlobalAveragePool
    - 同Cumsum，fp16 Reducesum 结果误差较大
5. Lerp 5/5
    - 朴素实现
    - 非广播情况 并且 fp32 使用tile方法
6. Histogram 5/5
    - 尝试优化-跟随源码实现
        - https://github.com/pytorch/pytorch/blob/71efbf701d594ffa31673147f82186386c79eb18/aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp#L45
        - https://github.com/pytorch/pytorch/blob/71efbf701d594ffa31673147f82186386c79eb18/aten/src/ATen/native/cpu/HistogramKernel.cpp#L79
        - https://github.com/pytorch/pytorch/blob/a5f816df18f49619cfb15ffcd3b74606a495c4e2/aten/src/ATen/native/Histogram.cpp#L390
    - 属性是需要按照从0开始，即使类型不一
    - 性能测试用例通过8-4:72470.44
    - minValue == maxValue -> 相当于二者都为0
    - 处理nan值(仅出现在fp32 fp16中)
    - 4:72151.0   fp32-2:72519.69  fp16-2:- -> 未过的点应该是int32的问题
    - int32计算index时需要转换数据类型
7. AsStrided 4/5
    - 朴素实现
8. Tril 4/5
9. Gelu 4/5
    - 尝试优化-使用erf求取
        - https://github.com/pytorch/pytorch/blob/312652c3258a3a8fec8fbfe6a9e8887e23d39c13/aten/src/ATen/cpu/vec/vec512/vec512_float.h#L190
    - erf近似
        - https://personal.math.ubc.ca/%7Ecbm/aands/page_299.htm
        - Target: 643.755014, result: 1444.96
10. Triu 4/5
    - 循环内部使用nowtileLength赋值，会导致1/5 原因是使用了%运算后没有处理为0的情况

## 目录说明

1. 带有标号的算子均为S2基础算子开发赛道的算子，仅预赛
2. 性能挑战赛道的算子(预赛+决赛)均在`performance-ops`目录下
    - GroupNormV2-预赛
    - BallQuery-预赛
    - DepthToSpace-预赛
    - Pdist-决赛