CANNBot Skills 内核索引

CANNBot Skills 内核索引 Kernel Index【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsUse this file to filter down to ≤3 candidate kernels before openingkernel-catalog.md. Fastest path for agents:conda run -n torch210npu python agent/scripts/select_kernel_example.py --query formula or task --topology topology --limit 3 --cataloguse this markdown table when you want a manual filter or the tool query is still too vagueEach row gives device, topology, path, and a one-line formula hint. Forstudy_foranddo_not_copy_when, read the matching entry inkernel-catalog.md. For machine-readable use, seeagent/index/kernels.json.How to usePick the rows whosedevicematches your target (a2 or a5).Narrow bytopology: cube-only / cube - vec / vec - cube / vec - cube - vec / vec - cube - vec - cube - vec / cube - vec - cube / cube - vec - cube - vec / vec-only / micro-only.Narrow byformula shape: pure matmul vs with postprocess, with reduction, with softmax, with online-accumulation, quantized, causal, etc.For each remaining candidate path, jump straight intokernel-catalog.mdwith Grep on the filename (e.g.^### .kernels/a2/flash_attn_full\.py.) — do not scroll. Read only that one entry, and stop afterstudy_for/do_not_copy_whenunless you still need deeper notes.Open the source file only after the catalogstudy_for/do_not_copy_whenconfirms the candidate.Vec-only and micro referencesDeviceTopologyPathFormula hinta2vec-onlyagent/example/kernels/a2/to_hif8_torch.pyto_hif8_torch(x)— emulated hif8 round, saturation sentinelsa2vec-onlyagent/example/kernels/a2/sort_rows.pyper-rowtorch.sort(x, dim-1)for[ROWS40, COLS4096]a5vec-onlyagent/example/kernels/a5/chunk_row_cumsum.pychunked row-recursive cumsuma5vec-onlyagent/example/kernels/a5/recurrent_state_attn_vec.pyrecurrent attention-state update,D128a5vec-onlyagent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.pyexp(x) 2on padded unaligned GM widtha5micro-onlyagent/example/kernels/a5/micro_cast_fp8_pack4_dual.pysrc.to(float8_e5m2)viamicroCube-onlyDeviceTopologyPathFormula hinta5cube-onlyagent/example/kernels/a5/matmul_float_mmad.pyz x y.t()— shortest cube baselinea5cube-onlyagent/example/kernels/a5/matmul_e5m2_shortcut.pyz x.float() y.float().t()with fp8 inputsa5cube-onlyagent/example/kernels/a5/matmul_kmkn_fp32_out.pyz x.float().t() y.float()(KM KN - MN)a5cube-onlyagent/example/kernels/a5/matmul_mknk_2dgrid_splitn.pyz x y.t()withsplitnand 2D core grida5cube-onlyagent/example/kernels/a5/matmul_mknk_2dgrid_splitk.pyz x y.t()withsplitkfor large-Ka2cube-onlyagent/example/kernels/a2/qk_matmul_batched.pyqk q.float() k.float().t()with batched BH flattena2cube-onlyagent/example/kernels/a2/attn_backward_dense_stage1_tail_dbuf.pyqk q.float() k.float().t()— DBuff tail variantCube - vec (postprocess on a5)DeviceTopologyPathFormula hinta5cube - vecagent/example/kernels/a5/basic_cube_vec_mix.pyz abs(x y.t()) 1.0— smallest mixed baselinea5cube - vecagent/example/kernels/a5/matmul_half_splitn_bias10p2_vf.py((x y) 10.2).half()— bias half output viavfa5cube - vecagent/example/kernels/a5/matmul_rowwise_norm.pyz (x y.t()) / row_sum(x y.t())a5cube - vecagent/example/kernels/a5/matmul_rowwise_norm_large_nk.pysame as rowwise_norm, larger N/Ka5cube - vecagent/example/kernels/a5/matmul_rowwise_l2_norm.pyL2-normalized matmul outputa5cube - vecagent/example/kernels/a5/matmul_chunk_absmax_norm128.pyper-row absmax normalize over 128-column chunksa5cube - vecagent/example/kernels/a5/matmul_kmkn_blockwise_quant128.pyx.float().t() y.float()with blockwise-128 quanta5cube - vecagent/example/kernels/a5/matmul_mknk_2dgrid_splitk_add1.pyx y.t() 1.0withsplitka5cube - vec (dual-output atomic)agent/example/kernels/a5/cube_vec_atomic_add_two_outputs.pyout_cube x y.t()with atomics, two sinksVec - cube (preprocess on a5)DeviceTopologyPathFormula hinta5vec - cubeagent/example/kernels/a5/vec_cube_abs_sqrt_matmul.pyz abs(x).sqrt() y.t()a5vec - cubeagent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.pysame as above, NZ-publisheda5vec - cubeagent/example/kernels/a5/recompute_wu_cube_vec.pyk_cumdecay attn (k_beta * decay_exp)Vec - cube - vec fusion (a5)DeviceTopologyPathFormula hinta5vec - cube - vecagent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.pyabs((x*2).half() y.t()) 1.0Vec - cube - vec - cube - vec state bridge (a5)DeviceTopologyPathFormula hinta5vec - cube - vec - cube - vecagent/example/kernels/a5/delta_h_state_bridge_v1_c8.pyaligneddelta_hbaseline with persistent state snapshots and delayed state updatea5vec - cube - vec - cube - vecagent/example/kernels/a5/delta_h_psudo_state_bridge_c8.pypseudo-reference comparison on the same stable state-bridge scheduleCube - vec - cube - vec lookahead (a5, MLA / MHA style)DeviceTopologyPathFormula hinta5cube - vec - cube - vecagent/example/kernels/a5/test_mla_entire.pystreamed MLA: score, softmax, delayedp k_nope, final normalizea5cube - vec - cube - vecagent/example/kernels/a5/mha_ifa.pystreamed single-rowsoftmax(q k.t()) va5cube - vec - cube - vecagent/example/kernels/a5/mha_ifa_256.pysame,BASES256a5cube - vec - cube - vecagent/example/kernels/a5/mha_ifa_fp8_scale_256.pyfp8 q/k/v, fp8-scaled p tiles,BASES256a5cube - vec - cube - vecagent/example/kernels/a5/flash_attn_full_fp8_causal.pymulti-row causal full attention, fp8 q/k/v fp8ptiles, tail-safeS1/S2a5cube - vec - cube - vecagent/example/kernels/a5/mha_ifa_nz.pysame, NZ-published probability tilesa5cube - vec - cube - vecagent/example/kernels/a5/mha_ifa_nz_256.pysame,BASES256 NZa2 mixed-pipeline (GM workspace bridges)DeviceTopologyPathFormula hinta2cube - vec (single GM bridge)agent/example/kernels/a2/attn_backward_dense_stage12_tail.pyqk q.float() k.float().t()stage-12 with taila2cube - vec (single GM bridge)agent/example/kernels/a2/flash_attn_score.pyexp(Q K^T / sqrt(D) - row_max)cast to halfa2cube - vec (single GM bridge, running max)agent/example/kernels/a2/flash_attn_score_iter.pysame, with cross-tile running row_maxa2cube - vec - cubeagent/example/kernels/a2/attn_backward_dense_total_tail.pydense attn-backward with taila2cube - vec - cubeagent/example/kernels/a2/attn_backward_dense_total_tail_causal.pysame, causal maskinga2cube - vec - cubeagent/example/kernels/a2/attn_backward_dense_total_tail_causal_hif8.pysame, hif8 probability patha2cube - vec - cube (double GM bridge, one-tile lookahead)agent/example/kernels/a2/flash_attn_score_pv.pyscore_j q k_j.t() * scalewith delayedp va2cube - vec - cube - vec (triple GM bridge, one-tile lookahead)agent/example/kernels/a2/flash_attn_unnorm.pyunnormalized flash-attn numeratora2cube - vec - cube - vec (triple GM bridge, final vec divide)agent/example/kernels/a2/flash_attn_full.pyfull flash-attn with running sum and final dividea2cube - vec - cube - vec (triple GM bridge, hif8 stage-1 vec path)agent/example/kernels/a2/flash_attn_full_pj_hif8.pysame math asflash_attn_full.py, hif8 probabilitya2cube - vec - cube - vec (hif8 diagonal causal mask, shared slot buffer)agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.pysame as hif8 variant, causal future-tile skipa2cube - vec - cube - vec (half probability, block-32 diagonal causal)agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.pysame math, halfp, block-32 causala2cube - vec - cube - vec (shared vec-side slot buffer for score and pv tiles)agent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.pysame as hif8 variant with shared UB slotGoing deeperForstudy_for/do_not_copy_whendetail on any single entry: openagent/references/examples/kernel-catalog.mdat the matching###heading.For programmatic filtering:agent/index/kernels.json.【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考