CANN/asc-devkit L0C到GM数据搬运-尧图企业网站定制

L0C到GM数据搬运DataCopy【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit产品支持情况产品是否支持Ascend 950PR/Ascend 950DT√Atlas A3 训练系列产品/Atlas A3 推理系列产品√Atlas A2 训练系列产品/Atlas A2 推理系列产品√Atlas 200I/500 A2 推理产品√Atlas 推理系列产品AI CorexAtlas 推理系列产品Vector CorexAtlas 训练系列产品x功能说明头文件路径为basic_api/kernel_operator_data_copy_intf.h。矩阵计算的结果存放在L0C BufferDataCopy接口用于将结果搬运至Global MemoryGM中并且在搬运过程中支持随路格式转换等操作。以如下产品型号为例Atlas A3 训练系列产品/Atlas A3 推理系列产品Atlas A2 训练系列产品/Atlas A2 推理系列产品下图展示了随路量化、随路ReLU、随路格式转换、随路通道拆分以及随路通道合并的有效组合、中间数据类型和数据路径。下图中的F32-F16与F32-BF16为非量化模式仅为Cast其余为随路scalar/tensor量化模式。图 1L0C2GM流程图函数原型DataCopy矩阵搬出接口支持多种随路能力的组合需要设置不同的寄存器配合数据搬运指令开启不同的数据搬运能力对应的接口如下数据搬运接口通路L0C BufferCO1-GM配合设置寄存器实现量化和ReLU激活NZ到ND格式的转换函数原型为template typename T, typename U __aicore__ inline void DataCopy(const GlobalTensorT dst, const LocalTensorU src, const DataCopyCO12DstParams intriParams)SetFixPipeConfig寄存器设置接口通过调用该接口设置Vector随路量化其中tensor的每个元素都代表一个量化参数启用tensor量化时需要设置。SetFixpipePreQuantFlag寄存器设置接口通过调用该接口设置Scalar随路量化参数此元素代表整个输出矩阵使用的量化参数启用Scalar量化时需要设置。SetFixpipeNz2ndFlag寄存器设置接口通过调用该接口设置随路NZ2ND格式转换配置使用随路NZ2ND需要设置。针对Atlas 200I/500 A2 推理产品还支持如下两个接口SetFixPipeClipRelu寄存器设置接口通过调用该接口设置ClipReLU操作的最大值。SetFixPipeAddr寄存器设置接口通过调用该接口设置Elementwise操作时LocalTensor的地址。参数说明表 1数据搬运DataCopy模板参数说明参数名描述T目的操作数的数据类型。支持的数据类型请参考数据类型。U源操作数的数据类型。支持的数据类型请参考数据类型。表 2数据搬运DataCopy接口参数说明参数名称输入/输出含义dst输出目的操作数类型为GlobalTensor数据格式为NZ、ND格式ND地址要求满足1字节对齐NZ地址需要满足32字节对齐。src输入源操作数类型为LocalTensor支持的物理地址为L0C BufferTPosition为CO1为Mmad接口计算的结果。数据格式为NZ格式地址需要满足6对齐。intriParams输入搬运参数类型为DataCopyCO12DstParams。具体定义请参考${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h${INSTALL_DIR}请替换为CANN软件安装后文件存储路径。表 3DataCopyCO12DstParams结构体参数定义参数名称含义sid此参数用户无需关注设置为0即可。nSize源NZ矩阵在N方向上的大小。取值范围nSize∈[0, 4095]nSize必须为16的倍数• 对于目的矩阵NZ输出输出类型为float类型时若开启channelSplit功能nSize必须为8的倍数。注nSize0表示不执行搬运该接口将被视为NOP空操作。mSize源NZ矩阵在M方向上的大小。• 不开启随路NZ2ND功能NZ2NZ搬运取值范围为mSize∈[0, 65535]。• 开启随路NZ2ND功能取值范围为mSize∈[0, 8192]。注mSize0表示不执行搬运该接口将被视为NOP空操作。dstStride• 不开启NZ2ND功能NZ2NZ搬运目的NZ矩阵中相邻Z排布的起始地址偏移取值不为0 单位为datablock32字节。• 开启随路NZ2ND功能目的ND矩阵每一行中的元素个数取值不为0 单位为element。srcStride源NZ矩阵中相邻Z排布的起始地址偏移取值范围为srcStride∈[0, 65535]单位为C0_Size16*sizeof(T)T为src的数据类型其值应填成mSize对16向上取整。unitFlagunitFlag是一种Mmad指令和Fixpipe指令细粒度的并行开启该功能后硬件每计算完一个分形计算结果就会被搬出。取值说明如下• 02b00不开启unitFlag。• 12b01无效值。• 22b10开启unitFlag硬件执行完指令之后不复位单元标记位。• 32b11开启unitFlag硬件执行完指令之后复位单元标记位。开启该功能时须将Mmad指令和Fixpipe指令的unitFlag值设置为2或3。参数设置方案和特性细节可参考Mmad计算中关键特性说明的UnitFlag章节。clipReluPre该参数仅在Atlas 200I/500 A2 推理产品支持。用于配置是否开启ClipReLU操作参数类型为uint8_t取值如下0不开启ClipReLU1开启ClipReLU此时需要调用 SetFixPipeClipRelu来设置clipReLU的最大值。• 该操作在随路量化后进行quantPre配置后才能使用当前支持的量化模式有F322F16/DEQF16/VDEQF16/QF322B8_PRE/VQF322B8_PRE/REQ8/VREQ8。eltWiseOp该参数仅在Atlas 200I/500 A2 推理产品支持。用于配置是否开启Elementwise操作及操作模式。Elementwise操作是指进行随路量化后可以逐个元素加/减一个LocalTensor大小为mSize * nSize具体LocalTensor地址相关参数需要调用 SetFixPipeAddr来设置。eltWiseOp参数类型为uint8_t取值如下• 0不开启Elementwise• 1Elementwise Addition• 2Elementwise SubtractionquantPre用于控制量化模式QuantMode_t类型具体定义如下• float/int32_t输出此需配置为QuantMode_t::NoQuant。• half/bfloat16_t输出此参数需配置为QuantMode_t::F322F16/QuantMode_t::F322BF16。• 配置为scalar量化时需要调用SetFixpipePreQuantFlag接口来设置scalar量化参数。• 配置为tensor量化时需要调用SetFixPipeConfig来设置tensor量化参数其中tensor量化参数需要通过DataCopy从L1 Buffer搬运至Fixpipe Buffer。注此参数需要用户手动配置不会自动推导配置对应量化模式。enum QuantMode_t{NoQuant, // 不开启量化功能F322F16, // Float32_2_Float16float cast成halfcast mode为CAST_RINT模式F322BF16, // Float32_2_BFloat16float cast成bfloat16_tcast mode为CAST_RINT模式DEQF16, // DeQuant_Float16int32_t量化成halfscalar量化VDEQF16, // Vector_DeQuant_Float16int32_t量化成halftensor量化QF322B8_PRE, // Quant_Float32_2_B8float量化成int8_t/uint8_tscalar量化VQF322B8_PRE, // Vector_Quant_Float32_2_B8float量化成int8_t/uint8_ttensor量化REQ8, // ReQuant_int8int32_t量化成int8_t/uint8_tscalar量化VREQ8, // Vector_ReQuant_int8int32_t量化成int8_t/uint8_ttensor量化};reluPre用于配置ReLU操作的模式类型为uint8_t取值如下。• 0不开启ReLU• 1Normal ReLUchannelSplit类型为bool配置是否开启通道切分功能仅在L0C Buffer(CO1) - GM通路下NZ格式float类型输出时生效。• false不开启• true开启nz2ndEn类型为bool配置是否开启NZ2ND的格式转换仅在L0C Buffer(CO1) - GM通路生效。如果要开启NZ2ND的功能需要同步调用SetFixpipeNz2ndFlag来设置格式转换的相关配置信息• false不开启• true开启数据类型源矩阵与目的矩阵支持的数据类型组合源矩阵L0C Buffer目的矩阵GMfloatint8_t、uint8_t、half、bfloat16_t、floatint32_tint8_t、uint8_t、half、int32_t返回值说明无约束说明对于量化输入为float32数据类型的说明如下标准的IEEE 754 float32格式为1bit符号位8bits指数位23bits尾数位当前AI处理器支持的float32格式为1bit符号位8bits指数位10bits尾数位。如果用户提供的是标准的IEEE 754 float32输入API内部会处理成处理器支持的float32格式进行计算此时如果golden数据生成过程中使用的是标准的IEEE 754 float32数据则可能引入精度不匹配问题需要修正golden数据的生成将量化参数的23bits尾数位的低13bits数据位清零再参与量化计算。源矩阵NZ格式地址要求6对齐目的矩阵ND格式地址要求满足1字节对齐NZ格式地址需要满足32字节对齐。当搬出的mSize或nSize中的任意一个值为0时该指令不会被执行。量化和ReLU参数不能为INF/NAN和非规格化数。目标数据不能有重叠。如果对目的地址有重叠写入硬件不会报告任何警告和错误也不保证重叠数据的写入顺序。unitFlag特性开启需要配合Mmad同时开启。针对如下产品型号特殊值/边界值约束说明如下Atlas A3 训练系列产品/Atlas A3 推理系列产品Atlas A2 训练系列产品/Atlas A2 推理系列产品对于浮点类型INF/NAN输入输出可以通过CTRL寄存器控制寄存器的CTRL[48]比特位进行设置控制浮点数量化搬出时的饱和模式非饱和模式CTRL[48]设置成1b1INF/NAN保持原输出。饱和模式CTRL[48]设置成1b0INF输出会被饱和为±MAX NaN输出会被饱和为0。// 设置CTRL[48]为0开启浮点数饱和模式 AscendC::AscendCUtils::SetOverflow(0);对于整数类型只有饱和模式。调用示例DataCopy完整样例请参考data_copy_l0c2gm示例Mmad含有矩阵乘偏置左矩阵和右矩阵的数据类型为int8_t结果矩阵的数据类型为int32_t。量化模式DEQF16Scalar量化参数为2.0将Mmad计算出的结果由int32_t量化成half并搬出。// Scalar量化量化参数为2.0 float quantScalar 2.0; uint64_t deqScalar static_castuint64_t(*reinterpret_castint32_t*(quantScalar)); // 将量化参数的标量写入寄存器供后续DataCopy指令使用 AscendC::SetFixpipePreQuantFlag(deqScalar); // 创建DataCopy的参数 AscendC::DataCopyCO12DstParams intriParams; intriParams.nSize n; intriParams.mSize m; intriParams.srcStride CeilAlign(m, CUBE_BLOCK); intriParams.dstStride n; intriParams.quantPre QuantMode_t::DEQF16; intriParams.reluPre 1; // 开启ReLU intriParams.nz2ndEn true; // 开启NZ2ND格式转换 // 根据intriParams中的参数执行最终的数据搬运 AscendC::DataCopy(cGM, cLocal, intriParams);示例Mmad含有矩阵乘偏置左矩阵和右矩阵的数据类型为int8_t结果矩阵的数据类型为int32_t。量化模式VDEQF16Tensor量化将Mmad计算出的结果由int32_t量化成half并搬出。// CeilAlign定义如下 __aicore__ inline uint16_t CeilAlign(uint16_t numerator, uint16_t denominator) { return (numerator denominator - 1) / denominator * denominator; } // 将GM中的量化数据 (quantAlphaGM) 拷贝到C1quantAlphaTensor uint16_t burstLen CeilAlign(n * sizeof(uint64_t), 128) / AscendC::ONE_BLK_SIZE; AscendC::DataCopyParams intriParams{ 1, burstLen, 0, 0 }; AscendC::DataCopy(quantAlphaTensor, quantAlphaGM, intriParams); // 设置同步确保量化数据拷贝到C1后执行后续DataCopy指令 AscendC::SetFlagAscendC::HardEvent::MTE2_FIX(EVENT_ID0); AscendC::WaitFlagAscendC::HardEvent::MTE2_FIX(EVENT_ID0); // 将C1中的量化数据quantAlphaTensor拷贝到C2PIPE2GMfbTensor uint16_t fbufBurstLen CeilAlign(deqDataSize, 128) / 128; AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0); AscendC::DataCopy(fbTensor, quantAlphaTensor, dataCopyParams); // 将量化参数数据写入寄存器供后续DataCopy指令使用 AscendC::SetFixPipeConfig(fbTensor); // 创建DataCopy的参数, AscendC::DataCopyCO12DstParams intriParams; intriParams.nSize CeilAlign(n, CUBE_BLOCK); intriParams.mSize m; intriParams.srcStride CeilAlign(m, CUBE_BLOCK); intriParams.dstStride m * C0_SIZE / AscendC::ONE_BLK_SIZE; // C0_SIZE 32 intriParams.quantPre QuantMode_t::VDEQF16; intriParams.reluPre 1; // 开启ReLU // 根据intriParams中的参数执行最终的数据搬运 AscendC::DataCopy(cGM, cLocal, intriParams);针对Atlas 200I/500 A2 推理产品示例Mmad含有矩阵乘偏置左矩阵和右矩阵的数据类型为int8_t结果矩阵的数据类型为int32_t。量化模式DEQF16scalar量化参数为0.5将Mmad计算出的结果由int32_t量化成half并搬出。#ifdef ASCENDC_CPU_DEBUG #include tikicpulib.h #endif #include kernel_operator.h #include ../../instrs/common_utils/register_utils.h template typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T class KernelCubeDataCopy{ public: __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn) { // ceiling of 16 Cout CoutIn; dilationH dilationHIn; dilationW dilationWIn; C0 32 / sizeof(fmap_T); C1 channelSize / C0; coutBlocks (Cout 16 - 1) / 16; ho H - dilationH * (Kh - 1); wo W - dilationW * (Kw - 1); howo ho * wo; howoRound ((howo 16 - 1) / 16) * 16; featureMapA1Size C1 * H * W * C0; // shape: [C1, H, W, C0] weightA1Size C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0] featureMapA2Size howoRound * (C1 * Kh * Kw * C0); weightB2Size (C1 * Kh * Kw * C0) * coutBlocks * 16; m howo; k C1 * Kh * Kw * C0; n Cout; biasSize Cout; // shape: [Cout] dstSize coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16] dstCO1Size coutBlocks * howoRound * 16; fmRepeat featureMapA2Size / (16 * C0); weRepeat weightB2Size / (16 * C0); deqMode deqModeIn; } __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm) { fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm); weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm); biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm); deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm); dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm); eleWiseGlobal.SetGlobalBuffer((__gm__ half*)eleWiseGm); pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T)); pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T)); pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T)); pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T)); pipe.InitBuffer(inQueueC1, 1, dstSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Split(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensorfmap_T featureMapA1 inQueueFmA1.AllocTensorfmap_T(); AscendC::LocalTensorweight_T weightB1 inQueueWeB1.AllocTensorweight_T(); AscendC::LocalTensordstCO1_T biasA1 inQueueBiasA1.AllocTensordstCO1_T(); AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_castuint16_t(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 }); AscendC::DataCopy(weightB1, weGlobal, { 1, static_castuint16_t(weightA1Size * sizeof(weight_T) / 32), 0, 0 }); AscendC::DataCopy(biasA1, biasGlobal, { 1, static_castuint16_t(biasSize * sizeof(dstCO1_T) / 32), 0, 0 }); inQueueFmA1.EnQue(featureMapA1); inQueueWeB1.EnQue(weightB1); inQueueBiasA1.EnQue(biasA1); } __aicore__ inline void Split() { AscendC::LocalTensorfmap_T featureMapA1 inQueueFmA1.DeQuefmap_T(); AscendC::LocalTensorweight_T weightB1 inQueueWeB1.DeQueweight_T(); AscendC::LocalTensorfmap_T featureMapA2 inQueueFmA2.AllocTensorfmap_T(); AscendC::LocalTensorweight_T weightB2 inQueueWeB2.AllocTensorweight_T(); uint8_t padList[] {0, 0, 0, 0}; // load3dv2 AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 }); // load2d AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 }); inQueueFmA2.EnQuefmap_T(featureMapA2); inQueueWeB2.EnQueweight_T(weightB2); inQueueFmA1.FreeTensor(featureMapA1); inQueueWeB1.FreeTensor(weightB1); } __aicore__ inline void Compute() { AscendC::LocalTensorfmap_T featureMapA2 inQueueFmA2.DeQuefmap_T(); AscendC::LocalTensorweight_T weightB2 inQueueWeB2.DeQueweight_T(); AscendC::LocalTensordstCO1_T dstCO1 outQueueCO1.AllocTensordstCO1_T(); AscendC::LocalTensordstCO1_T biasA1 inQueueBiasA1.DeQuedstCO1_T(); // C A * B bias // m左矩阵Heightk左矩阵Widthn右矩阵Width AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false }); outQueueCO1.EnQuedstCO1_T(dstCO1); inQueueFmA2.FreeTensor(featureMapA2); inQueueWeB2.FreeTensor(weightB2); } __aicore__ inline void CopyOut() { AscendC::LocalTensordstCO1_T dstCO1 outQueueCO1.DeQuedstCO1_T(); // 开启DEQF16量化量化参数设置为0.5 float tmp (float)0.5; // 将float的tmp转换成uint64_t的deqScalar uint64_t deqScalar static_castuint64_t(*reinterpret_castint32_t*(tmp)); bool nz2ndEn false; // nz2nd不开启时nSize必须为16的倍数 uint16_t nSize coutBlocks * 16; uint16_t mSize m; // srcStride必须为16的倍数 uint16_t srcStride (m 16 - 1) / 16 * 16; // nz2nd不开启时dstStride为burst头到头的距离且为32字节对齐 uint32_t dstStride m * sizeof(dst_T) * 16 / 32; if (nz2ndEn) { // nd矩阵的数量为1src_nd_stride与dst_nd_stride填1 AscendC::SetFixpipeNz2ndFlag(1, 1, 1); // nz2nd开启时nSize可以不为16的倍数与Mmad的n保持一致 nSize n; // nz2nd开启时dstStride表示同一nd矩阵的相邻连续行的间隔与n保持一致 dstStride nSize; }; // 不开启ReLU与channelSplit AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn); // mov l0c to gm, deq scalar quant AscendC::SetFixpipePreQuantFlag(deqScalar); // 设置量化参数 AscendC::PipeBarrierPIPE_FIX(); AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // // mov l0c to gm, deq tensor quant // // 需要额外申请deq tensor的gm空间将值搬运到workA1 // AscendC::LocalTensoruint64_t workA1 inQueueDeqA1.AllocTensoruint64_t(); // // deq tensor的size // uint16_t deqSize 128; // AscendC::DataCopy(workA1, deqGlobal, deqSize); // // deq tensor在fix上的地址 // AscendC::LocalTensoruint64_t deqFB inQueueDeqFB.AllocTensoruint64_t(); // // l1-fix, burst_len unit is 128Bytes // uint16_t fbufBurstLen deqSize / 128; // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0); // AscendC::DataCopy(deqFB, workA1, dataCopyParams); // // 设置量化tensor // AscendC::SetFixPipeConfig(deqFB); // AscendC::PipeBarrierPIPE_FIX(); // // mov l0c to gm量化操作后开启ClipReLU操作 // intriParams.clipReluPre 1; // // 设置clip ReLU的值到寄存器 // uint64_t clipReluVal 0x3c00; // value 1, half // SetFixPipeClipRelu(clipReluVal); // //mov l0c to gm量化操作后设置element-wise操作Add // intriParams.eltWiseOp 1; // // 需要额外申请element-wise tensor的gm空间将值搬到eleWiseTensor // AscendC::LocalTensorhalf eleWiseTensor inQueueC1.AllocTensorhalf(); // DataCopy(eleWiseTensor, eleWiseGlobal, { 1, static_castuint16_t(sizeof(half) * dst_size / 32), 0, 0 }); // AscendC::PipeBarrierPIPE_ALL(); // // 将存放element-wise tensor的地址设置到寄存器里 // SetFixPipeAddr(eleWiseTensor, 1); // AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // inQueueDeqA1.FreeTensor(workA1); // inQueueDeqFB.FreeTensor(deqFB); // outQueueCO1.FreeTensor(dstCO1); // inQueueC1.FreeTensor(eleWiseTensor); } private: AscendC::TPipe pipe; // feature map queue AscendC::TQueAscendC::TPosition::A1, 1 inQueueFmA1; AscendC::TQueAscendC::TPosition::A2, 1 inQueueFmA2; // weight queue AscendC::TQueAscendC::TPosition::B1, 1 inQueueWeB1; AscendC::TQueAscendC::TPosition::B2, 1 inQueueWeB2; // bias queue AscendC::TQueAscendC::TPosition::A1, 1 inQueueBiasA1; // deq tensor queue AscendC::TQueAscendC::TPosition::A1, 1 inQueueDeqA1; // fb dst of deq tensor AscendC::TQueAscendC::TPosition::C2PIPE2GM, 1 inQueueDeqFB; // dst queue AscendC::TQueAscendC::TPosition::CO1, 1 outQueueCO1; // element-wise tensor AscendC::TQueAscendC::TPosition::C1, 1 inQueueC1; AscendC::GlobalTensorfmap_T fmGlobal; AscendC::GlobalTensorweight_T weGlobal; AscendC::GlobalTensordst_T dstGlobal; AscendC::GlobalTensoruint64_t deqGlobal; AscendC::GlobalTensordstCO1_T biasGlobal; AscendC::GlobalTensorhalf eleWiseGlobal; uint16_t channelSize 32; uint16_t H 4, W 4; uint8_t Kh 2, Kw 2; uint16_t Cout; uint16_t C0, C1; uint8_t dilationH, dilationW; uint16_t coutBlocks, ho, wo, howo, howoRound; uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size; uint16_t m, k, n; uint8_t fmRepeat, weRepeat; QuantMode_t deqMode QuantMode_t::NoQuant; }; #define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn) \ extern C __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, \ __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm) \ { \ if (g_coreType AscendC::AIV) { \ return; \ } \ KernelCubeDataCopydst_type, fmap_type, weight_type, dstCO1_type op(CoutIn, dilationHIn, dilationWIn, \ deqModeIn); \ op.Init(fmGm, weGm, biasGm, deqGm, eleWiseGm, dstGm); \ op.Process(); \ } KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

相关新闻

CANN数学算子库ops-math核心技术深度解析：从逐元素向量化并行到批量张量形态变换的昇腾NPU加速原理与工程实践全路径

洛雪音乐音源配置完全指南：3步打造个人无损音乐库

CANN Transformer算子库ops-transformer深度技术解析：FlashAttention与MC2通算融合在昇腾NPU上的全链路性能优化原理与工程实践

别再手动拟合了！用Matlab样条工具箱搞定复杂曲线，附完整代码

桌面数字伙伴革命：DyberPet如何让你的电脑桌面活起来

如何快速搞定FBX格式转换？FbxFormatConverter工具的终极使用指南

嵌入式通信协议设计：NXP ISF命令响应与流式传输详解

T5-Base终极指南：如何快速上手这个强大的文本生成模型

Summernote富文本编辑器技术架构解析与实战应用

深入S32K3时钟树：从FIRC到PLL，如何用S32DS为你的应用选对时钟源？

i.MX 6SoloX异构处理器开发实战：A9与M4协同、安全启动与性能优化

i.MX 7ULP异构处理器：架构解析与低功耗物联网开发实战

陪诊小程序开发玩法分析：全流程就医服务架构、匹配机制与落地方案

从“大通铺”到“写字楼”的链路层进化史

RAG 召回质量治理：用 Go 构建可调试的切片、检索与重排链路

从陌生到熟悉：Royal TSX中文汉化包的体验地图之旅

时延最优化设计

别再重启了！Windows 11下dwm.exe内存飙升，我用Intel官方工具升级显卡驱动搞定