A2 Vec-Only Kernel Authoring【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing or debugging a pure vec kernel on a2 (easyasc.a2) with no cube stage. Typical targets are elementwise transforms, bit-level float analysis, scalar-threshold gating, and quantization-style postprocess.Do not use this file as the main guide for mixed cube/vec kernels. If cube is involved, start fromagent/references/constraints/a2-device.mdand the matching pattern file instead.GoalCapture the stable authoring rules for a2 vec-only kernels so that:the kernel body starts from the right minimal structureUB buffers are chosen intentionallycompare_scalar,select,reinterpret, andcastare used with the repositorys real semanticsexact numeric contracts are not delegated to simulator rounding by accident1. Use this layer whenThis file is the right first read when:the whole kernel isGM - UB vec ops - GMthere is novf, noReg, and no cube handoffthe logic is mostly elementwise, flag-driven, or bit-driventhe output contract depends on thresholding, saturation, or explicit roundingRead another file first when:you need row-wise reductions or narrow-broadcast arithmeticthen also readagent/references/constraints/vec-reduction-a2.mdandagent/references/constraints/vec-stride.mdyou need explicit vec mask behaviorthen readagent/references/constraints/mask.mdyou need cube - vec or vec - cube ownershipthen readagent/references/constraints/a2-device.mdand the matching file underagent/references/patterns/2. Minimal kernel skeletonStable pure-vec structure on a2:kernel() def vec_kernel(x: GMTensor, y: GMTensor, total: Var): data Tensor(DT.float, [1, TILE], Position.UB) work Tensor(DT.float, [1, TILE], Position.UB) flag Tensor(DT.uint8, [1, TILE], Position.UB) with vec_scope(): n_tiles CeilDiv(total, TILE) tile_per_core CeilDiv(n_tiles, GetVecNum()) tile_start Var(tile_per_core * GetVecIdx()) tile_end Min(tile_start tile_per_core, n_tiles) dup(...) with auto_sync(): for t in range(tile_start, tile_end): n1 Var(t * TILE) n_valid Min(total - n1, TILE) data x[n1:n1 n_valid] # vec compute on UB y[n1:n1 n_valid] workWhat this skeleton gets right:vec_scope()decides tile ownership across vec lanes before the loopconstants are initialized once withdup(...)the inner loop keeps all work in UBtail handling stays local throughn_valid3. UB buffer selection rulesFor pure vec kernels, prefer plainTensor(..., Position.UB)by default. Do not start fromDBuffunless you truly need staged overlap or lookahead.Useful buffer categories:data tiles:Tensor(DT.float, [1, TILE], Position.UB)temporary compute buffers: same dtype and shape as the data tilecompare/select flags:Tensor(DT.uint8, [1, TILE], Position.UB)bit masks for reinterpret paths:Tensor(DT.uint32, [1, TILE], Position.UB)or another width-matched integer viewfinal integer staging for exact rounding:Tensor(DT.int, [1, TILE], Position.UB)Practical rule:if the whole tile is consumed and produced once per loop iteration,Tensoris usually enoughif a buffer lifetime crosses iterations or producer/consumer stages, reconsider the topology before adding double buffering4. Stable vec control idioms4.1compare_scalarselectUsecompare_scalarto builduint8flag tensors, then useselectto route values.Important repository behavior:compare_scalar(...)ignores the current vec maskselect(...)also ignores the current vec maskselection is controlled only by the explicituint8flag tensoron current a2 hardware/runtime, do not rely onuint8 - floatcasts for compare flags; keep mask-controlled float paths incompare_scalar(...) select(...)This makes them the stable control-flow building blocks for pure vec kernels.Typical uses:finite vs non-finite splitunderflow / overflow gatingsign-dependent bias selectionreplacing invalid values before a bit reinterpret path4.2 Non-finite guardingIf the later path assumes finite floats, sanitize first:absub x.abs() compare_scalar(finiteflag, absub, FLOAT32_FINITE_MAX, CompareMode.LE) select(workub, finiteflag, x, 0.0)Then restore original non-finite values at the end:select(outub, finiteflag, outub, x)This avoids pushingNaN/Infthrough exponent extraction or scale math while keeping the control constant finite.5. Bit-level float analysis withreinterpretFor exponent/mantissa logic, usereinterpret(...)instead of float arithmetic guesses.Stable pattern fromagent/example/kernels/a2/to_hif8_torch.py:x_u16 workub.reinterpret(DT.uint16) exp_u16 expub.reinterpret(DT.uint16) mask_u16 expmask.reinterpret(DT.uint16) vand(exp_u16, x_u16, mask_u16)Useful rules:reinterpretis a view change, not a numeric castit rescales the second dimension by dtype-width ratioit is legal on UB hereit does not supportL0CWhen extracting absolute exponent-style metadata:usevnotvandon the reinterpreted integer viewthen reinterpret toDT.intor another arithmetic dtype only after the bit pattern is where you want it6. Exact rounding: do not over-trust default veccastDefaultcast(...)is convenient for ordinary dtype conversion, but it should not be treated as a proof of a higher-level numeric contract.When the formula explicitly requires a rounding rule such as:sign(x) * floor(abs(x) 0.5)round-half-away-from-zeroquantization followed by scale restoreprefer an explicit sequence.Stable sequence:outub x / scale compare_scalar(nonnegflag, outub, 0.0, CompareMode.GE) select(biasub, nonnegflag, plus_halfub, -0.5) outub outub biasub cast(intub, outub, round_modeRoundMode.TRUNC) cast(outub, intub) outub outub * scaleWhy this is safer:the sign-dependent0.5 / -0.5encodes the formula directlyRoundMode.TRUNCis only used for the final integer dropthe result no longer depends on the simulators interpretation of a more implicit rounding modePractical rule:use directcast(dst, src)when the formula only needs a normal dtype conversionuse an explicit bias TRUNCpath when the rounding rule itself is part of the contractif the decision came from auint8compare flag, materialize the float branch withselect(...); do not plan a follow-upuint8 - floatcast7. Tile-size and tail heuristicsFor float vec-only kernels,TILE 512is a good default starting point:simple to reason aboutcomfortably small for a2 UBlarge enough to amortize fixed per-tile workFor tail handling:keepn_valid Min(total - n1, TILE)load/store through GM slices using that tail widthavoid adding a separate tail kernel unless the contract truly needs special handlingDo not optimize tile size first. Get the contract right with one simple tile size, then revisit only if UB pressure or runtime suggests it.8. When a vec-only kernel stops being simpleEscalate to another focused file when you hit one of these signs:wide[M, 128]buffers interacting with narrow[M, 8]buffersreadagent/references/constraints/vec-stride.mdrow max / row sum / online normalizationreadagent/references/constraints/vec-reduction-a2.mdtemporary partial masks or masked writeback behaviorreadagent/references/constraints/mask.mdcross-stage workspace reuse or delayed consumer logicreadagent/references/constraints/a2-device.mdand the matching pattern underagent/references/patterns/9. Concrete examplesStudy first:agent/example/kernels/a2/to_hif8_torch.pyStudy carefully but do not copy blindly:agent/example/demo/a2/a2_hif8.pyWhy the demo is not enough for exact-contract work:it is useful for exponent extraction and threshold structurebut it relies on a simpler cast/store pathand it is not the best source when the exact PyTorch rounding contract must be preserved【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考
CANN A2纯向量核编写
A2 Vec-Only Kernel Authoring【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing or debugging a pure vec kernel on a2 (easyasc.a2) with no cube stage. Typical targets are elementwise transforms, bit-level float analysis, scalar-threshold gating, and quantization-style postprocess.Do not use this file as the main guide for mixed cube/vec kernels. If cube is involved, start fromagent/references/constraints/a2-device.mdand the matching pattern file instead.GoalCapture the stable authoring rules for a2 vec-only kernels so that:the kernel body starts from the right minimal structureUB buffers are chosen intentionallycompare_scalar,select,reinterpret, andcastare used with the repositorys real semanticsexact numeric contracts are not delegated to simulator rounding by accident1. Use this layer whenThis file is the right first read when:the whole kernel isGM - UB vec ops - GMthere is novf, noReg, and no cube handoffthe logic is mostly elementwise, flag-driven, or bit-driventhe output contract depends on thresholding, saturation, or explicit roundingRead another file first when:you need row-wise reductions or narrow-broadcast arithmeticthen also readagent/references/constraints/vec-reduction-a2.mdandagent/references/constraints/vec-stride.mdyou need explicit vec mask behaviorthen readagent/references/constraints/mask.mdyou need cube - vec or vec - cube ownershipthen readagent/references/constraints/a2-device.mdand the matching file underagent/references/patterns/2. Minimal kernel skeletonStable pure-vec structure on a2:kernel() def vec_kernel(x: GMTensor, y: GMTensor, total: Var): data Tensor(DT.float, [1, TILE], Position.UB) work Tensor(DT.float, [1, TILE], Position.UB) flag Tensor(DT.uint8, [1, TILE], Position.UB) with vec_scope(): n_tiles CeilDiv(total, TILE) tile_per_core CeilDiv(n_tiles, GetVecNum()) tile_start Var(tile_per_core * GetVecIdx()) tile_end Min(tile_start tile_per_core, n_tiles) dup(...) with auto_sync(): for t in range(tile_start, tile_end): n1 Var(t * TILE) n_valid Min(total - n1, TILE) data x[n1:n1 n_valid] # vec compute on UB y[n1:n1 n_valid] workWhat this skeleton gets right:vec_scope()decides tile ownership across vec lanes before the loopconstants are initialized once withdup(...)the inner loop keeps all work in UBtail handling stays local throughn_valid3. UB buffer selection rulesFor pure vec kernels, prefer plainTensor(..., Position.UB)by default. Do not start fromDBuffunless you truly need staged overlap or lookahead.Useful buffer categories:data tiles:Tensor(DT.float, [1, TILE], Position.UB)temporary compute buffers: same dtype and shape as the data tilecompare/select flags:Tensor(DT.uint8, [1, TILE], Position.UB)bit masks for reinterpret paths:Tensor(DT.uint32, [1, TILE], Position.UB)or another width-matched integer viewfinal integer staging for exact rounding:Tensor(DT.int, [1, TILE], Position.UB)Practical rule:if the whole tile is consumed and produced once per loop iteration,Tensoris usually enoughif a buffer lifetime crosses iterations or producer/consumer stages, reconsider the topology before adding double buffering4. Stable vec control idioms4.1compare_scalarselectUsecompare_scalarto builduint8flag tensors, then useselectto route values.Important repository behavior:compare_scalar(...)ignores the current vec maskselect(...)also ignores the current vec maskselection is controlled only by the explicituint8flag tensoron current a2 hardware/runtime, do not rely onuint8 - floatcasts for compare flags; keep mask-controlled float paths incompare_scalar(...) select(...)This makes them the stable control-flow building blocks for pure vec kernels.Typical uses:finite vs non-finite splitunderflow / overflow gatingsign-dependent bias selectionreplacing invalid values before a bit reinterpret path4.2 Non-finite guardingIf the later path assumes finite floats, sanitize first:absub x.abs() compare_scalar(finiteflag, absub, FLOAT32_FINITE_MAX, CompareMode.LE) select(workub, finiteflag, x, 0.0)Then restore original non-finite values at the end:select(outub, finiteflag, outub, x)This avoids pushingNaN/Infthrough exponent extraction or scale math while keeping the control constant finite.5. Bit-level float analysis withreinterpretFor exponent/mantissa logic, usereinterpret(...)instead of float arithmetic guesses.Stable pattern fromagent/example/kernels/a2/to_hif8_torch.py:x_u16 workub.reinterpret(DT.uint16) exp_u16 expub.reinterpret(DT.uint16) mask_u16 expmask.reinterpret(DT.uint16) vand(exp_u16, x_u16, mask_u16)Useful rules:reinterpretis a view change, not a numeric castit rescales the second dimension by dtype-width ratioit is legal on UB hereit does not supportL0CWhen extracting absolute exponent-style metadata:usevnotvandon the reinterpreted integer viewthen reinterpret toDT.intor another arithmetic dtype only after the bit pattern is where you want it6. Exact rounding: do not over-trust default veccastDefaultcast(...)is convenient for ordinary dtype conversion, but it should not be treated as a proof of a higher-level numeric contract.When the formula explicitly requires a rounding rule such as:sign(x) * floor(abs(x) 0.5)round-half-away-from-zeroquantization followed by scale restoreprefer an explicit sequence.Stable sequence:outub x / scale compare_scalar(nonnegflag, outub, 0.0, CompareMode.GE) select(biasub, nonnegflag, plus_halfub, -0.5) outub outub biasub cast(intub, outub, round_modeRoundMode.TRUNC) cast(outub, intub) outub outub * scaleWhy this is safer:the sign-dependent0.5 / -0.5encodes the formula directlyRoundMode.TRUNCis only used for the final integer dropthe result no longer depends on the simulators interpretation of a more implicit rounding modePractical rule:use directcast(dst, src)when the formula only needs a normal dtype conversionuse an explicit bias TRUNCpath when the rounding rule itself is part of the contractif the decision came from auint8compare flag, materialize the float branch withselect(...); do not plan a follow-upuint8 - floatcast7. Tile-size and tail heuristicsFor float vec-only kernels,TILE 512is a good default starting point:simple to reason aboutcomfortably small for a2 UBlarge enough to amortize fixed per-tile workFor tail handling:keepn_valid Min(total - n1, TILE)load/store through GM slices using that tail widthavoid adding a separate tail kernel unless the contract truly needs special handlingDo not optimize tile size first. Get the contract right with one simple tile size, then revisit only if UB pressure or runtime suggests it.8. When a vec-only kernel stops being simpleEscalate to another focused file when you hit one of these signs:wide[M, 128]buffers interacting with narrow[M, 8]buffersreadagent/references/constraints/vec-stride.mdrow max / row sum / online normalizationreadagent/references/constraints/vec-reduction-a2.mdtemporary partial masks or masked writeback behaviorreadagent/references/constraints/mask.mdcross-stage workspace reuse or delayed consumer logicreadagent/references/constraints/a2-device.mdand the matching pattern underagent/references/patterns/9. Concrete examplesStudy first:agent/example/kernels/a2/to_hif8_torch.pyStudy carefully but do not copy blindly:agent/example/demo/a2/a2_hif8.pyWhy the demo is not enough for exact-contract work:it is useful for exponent extraction and threshold structurebut it relies on a simpler cast/store pathand it is not the best source when the exact PyTorch rounding contract must be preserved【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考