开始讲解之前推荐一下我的专栏本专栏的内容支持(分类、检测、分割、追踪、关键点检测),专栏目前为限时折扣欢迎大家订阅本专栏本专栏每周更新5-7篇最新机制更有包含我所有改进的文件和交流群提供给大家本人定期在群内分享发表论文方法和经验。一、本文介绍本文给大家带来的改进机制是实现级联群体注意力机制CascadedGroupAttention其主要思想为增强输入到注意力头的特征的多样性。与以前的自注意力不同它为每个头提供不同的输入分割并跨头级联输出特征。这种方法不仅减少了多头注意力中的计算冗余而且通过增加网络深度来提升模型容量亲测在我的25个类别的数据上大部分的类别均有一定的涨点效果仅有部分的类别保持不变同时给该注意力机制含有二次创新的机会欢迎大家订阅我的专栏一起学习YOLO专栏链接YOLOv26有效涨点专栏包含Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制目录一、本文介绍二、 CascadedGroupAttention的基本原理三、CGA的核心代码四、CGA的添加方式4.1 修改一4.2 修改二4.3 修改三4.4 修改四4.5 修改五4.6 修改六五、正式训练5.1 yaml文件5.1.1 yaml文件15.1.2 yaml文件25.2 训练代码5.3 训练过程截图五、本文总结二、 CascadedGroupAttention的基本原理官方论文地址官方论文地址点击即可跳转官方代码地址官方代码地址点击即可跳转Cascaded Group Attention (CGA)是在文章 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention 中提出的一种新型注意力机制。其核心思想是增强输入到注意力头的特征的多样性。与以前的自注意力不同它为每个头提供不同的输入分割并跨头级联输出特征。这种方法不仅减少了多头注意力中的计算冗余而且通过增加网络深度来提升模型容量。具体来说CGA 将输入特征分成不同的部分每部分输入到一个注意力头。每个头计算其自注意力映射然后将所有头的输出级联起来并通过一个线性层将它们投影回输入的维度。通过这样的方式CGA 在不增加额外参数的情况下提高了模型的计算效率。另外通过串联的方式每个头的输出都会添加到下一个头的输入中从而逐步精化特征表示。Cascaded Group Attention 的优点包括1. 提高了注意力图的多样性。2. 减少了计算冗余因为它减少了 QKV 层中输入和输出通道的数量。3. 增加了网络深度从而进一步提高了模型容量同时只增加了很小的延迟开销因为每个头的 QK 通道维度较小。这张图描绘了 EfficientViT 模型中 Cascaded Group Attention (CGA) 模块的架构。CGA模块位于图中的(c)部分可以看到它的作用是处理输入特征并提供分级的注意力机制。在这个模块中输入首先被分割成多个部分每个部分对应一个注意力头。每个头独立地计算其自注意力并产生一个输出。然后所有头的输出被级联(concatenate)在一起通过一个线性投影层形成最终的输出。这种设计允许模型在不同的层次上捕捉特征通过级联增强了特征之间的交互同时提高了计算效率。级联组注意力的关键点在于每个注意力头只关注输入的一部分然后把所有头的注意力合并起来来获取一个全面的特征表示。这样做的好处是减少了计算重复并增加了注意力的多样性因为不同的头可能会关注输入的不同方面。这种方法提高了模型的内存和计算效率同时保持或增强模型的性能。三、CGA的核心代码代码使用方式看章节四# https://github.com/microsoft/Cream/blob/ef68993c764f241a768cd69a087ed567dec6cb40/EfficientViT/classification/model/efficientvit.py#L104-L181 import itertools import torch from torch import nn __all__ [C2PSA_CGA, LocalWindowAttention] class Conv2d_BN(torch.nn.Sequential): def __init__(self, a, b, ks1, stride1, pad0, dilation1, groups1, bn_weight_init1, resolution-10000): super().__init__() self.add_module(c, torch.nn.Conv2d( a, b, ks, stride, pad, dilation, groups, biasFalse)) self.add_module(bn, torch.nn.BatchNorm2d(b)) torch.nn.init.constant_(self.bn.weight, bn_weight_init) torch.nn.init.constant_(self.bn.bias, 0) torch.no_grad() def switch_to_deploy(self): c, bn self._modules.values() w bn.weight / (bn.running_var bn.eps)**0.5 w c.weight * w[:, None, None, None] b bn.bias - bn.running_mean * bn.weight / \ (bn.running_var bn.eps)**0.5 m torch.nn.Conv2d(w.size(1) * self.c.groups, w.size( 0), w.shape[2:], strideself.c.stride, paddingself.c.padding, dilationself.c.dilation, groupsself.c.groups) m.weight.data.copy_(w) m.bias.data.copy_(b) return m class CascadedGroupAttention(torch.nn.Module): r Cascaded Group Attention. Args: dim (int): Number of input channels. key_dim (int): The dimension for query and key. num_heads (int): Number of attention heads. attn_ratio (int): Multiplier for the query dim for value dimension. resolution (int): Input resolution, correspond to the window size. kernels (List[int]): The kernel size of the dw conv on query. def __init__(self, dim, key_dim, num_heads8, attn_ratio4, resolution14, kernels[5, 5, 5, 5], ): super().__init__() self.num_heads num_heads self.scale key_dim ** -0.5 self.key_dim key_dim self.d int(attn_ratio * key_dim) self.attn_ratio attn_ratio qkvs [] dws [] for i in range(num_heads): qkvs.append(Conv2d_BN(dim // (num_heads), self.key_dim * 2 self.d, resolutionresolution)) dws.append(Conv2d_BN(self.key_dim, self.key_dim, kernels[i], 1, kernels[i] // 2, groupsself.key_dim, resolutionresolution)) self.qkvs torch.nn.ModuleList(qkvs) self.dws torch.nn.ModuleList(dws) self.proj torch.nn.Sequential(torch.nn.ReLU(), Conv2d_BN( self.d * num_heads, dim, bn_weight_init0, resolutionresolution)) points list(itertools.product(range(resolution), range(resolution))) N len(points) attention_offsets {} idxs [] for p1 in points: for p2 in points: offset (abs(p1[0] - p2[0]), abs(p1[1] - p2[1])) if offset not in attention_offsets: attention_offsets[offset] len(attention_offsets) idxs.append(attention_offsets[offset]) self.attention_biases torch.nn.Parameter( torch.zeros(num_heads, len(attention_offsets))) self.register_buffer(attention_bias_idxs, torch.LongTensor(idxs).view(N, N)) torch.no_grad() def train(self, modeTrue): super().train(mode) if mode and hasattr(self, ab): del self.ab else: self.ab self.attention_biases[:, self.attention_bias_idxs] def forward(self, x): # x (B,C,H,W) B, C, H, W x.shape trainingab self.attention_biases[:, self.attention_bias_idxs] feats_in x.chunk(len(self.qkvs), dim1) feats_out [] feat feats_in[0] for i, qkv in enumerate(self.qkvs): if i 0: # add the previous output to the input feat feat feats_in[i] feat qkv(feat) q, k, v feat.view(B, -1, H, W).split([self.key_dim, self.key_dim, self.d], dim1) # B, C/h, H, W q self.dws[i](q) q, k, v q.flatten(2), k.flatten(2), v.flatten(2) # B, C/h, N attn ( (q.transpose(-2, -1) k) * self.scale (trainingab[i] if self.training else self.ab[i]) ) attn attn.softmax(dim-1) # BNN feat (v attn.transpose(-2, -1)).view(B, self.d, H, W) # BCHW feats_out.append(feat) x self.proj(torch.cat(feats_out, 1)) return x class LocalWindowAttention(torch.nn.Module): r Local Window Attention. Args: dim (int): Number of input channels. key_dim (int): The dimension for query and key. num_heads (int): Number of attention heads. attn_ratio (int): Multiplier for the query dim for value dimension. resolution (int): Input resolution. window_resolution (int): Local window resolution. kernels (List[int]): The kernel size of the dw conv on query. def __init__(self, dim, num_heads4, attn_ratio4, resolution14, window_resolution7, kernels[5, 5, 5, 5], ): super().__init__() key_dim dim // 16 # 必须放缩16倍否则会报错 self.dim dim self.num_heads num_heads self.resolution resolution assert window_resolution 0, window_size must be greater than 0 self.window_resolution window_resolution self.attn CascadedGroupAttention(dim, key_dim, num_heads, attn_ratioattn_ratio, resolutionwindow_resolution, kernelskernels, ) def forward(self, x): B, C, H, W x.shape if H self.window_resolution and W self.window_resolution: x self.attn(x) else: x x.permute(0, 2, 3, 1) pad_b (self.window_resolution - H % self.window_resolution) % self.window_resolution pad_r (self.window_resolution - W % self.window_resolution) % self.window_resolution padding pad_b 0 or pad_r 0 if padding: x torch.nn.functional.pad(x, (0, 0, 0, pad_r, 0, pad_b)) pH, pW H pad_b, W pad_r nH pH // self.window_resolution nW pW // self.window_resolution # window partition, BHWC - B(nHh)(nWw)C - BnHnWhwC - (BnHnW)hwC - (BnHnW)Chw x x.view(B, nH, self.window_resolution, nW, self.window_resolution, C).transpose(2, 3).reshape( B * nH * nW, self.window_resolution, self.window_resolution, C ).permute(0, 3, 1, 2) x self.attn(x) # window reverse, (BnHnW)Chw - (BnHnW)hwC - BnHnWhwC - B(nHh)(nWw)C - BHWC x x.permute(0, 2, 3, 1).view(B, nH, nW, self.window_resolution, self.window_resolution, C).transpose(2, 3).reshape(B, pH, pW, C) if padding: x x[:, :H, :W].contiguous() x x.permute(0, 3, 1, 2) return x def autopad(k, pNone, d1): # kernel, padding, dilation Pad to same shape outputs. if d 1: k d * (k - 1) 1 if isinstance(k, int) else [d * (x - 1) 1 for x in k] # actual kernel-size if p is None: p k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad return p class Conv(nn.Module): Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation). default_act nn.SiLU() # default activation def __init__(self, c1, c2, k1, s1, pNone, g1, d1, actTrue): Initialize Conv layer with given arguments including activation. super().__init__() self.conv nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groupsg, dilationd, biasFalse) self.bn nn.BatchNorm2d(c2) self.act self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity() def forward(self, x): Apply convolution, batch normalization and activation to input tensor. return self.act(self.bn(self.conv(x))) def forward_fuse(self, x): Perform transposed convolution of 2D data. return self.act(self.conv(x)) class PSABlock(nn.Module): PSABlock class implementing a Position-Sensitive Attention block for neural networks. This class encapsulates the functionality for applying multi-head attention and feed-forward neural network layers with optional shortcut connections. Attributes: attn (Attention): Multi-head attention module. ffn (nn.Sequential): Feed-forward neural network module. add (bool): Flag indicating whether to add shortcut connections. Methods: forward: Performs a forward pass through the PSABlock, applying attention and feed-forward layers. Examples: Create a PSABlock and perform a forward pass def __init__(self, c, attn_ratio0.5, num_heads4, shortcutTrue) - None: Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction. super().__init__() self.attn LocalWindowAttention(c) self.ffn nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, actFalse)) self.add shortcut def forward(self, x): Executes a forward pass through PSABlock, applying attention and feed-forward layers to the input tensor. x x self.attn(x) if self.add else self.attn(x) x x self.ffn(x) if self.add else self.ffn(x) return x class C2PSA_CGA(nn.Module): C2PSA module with attention mechanism for enhanced feature extraction and processing. This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations. Attributes: c (int): Number of hidden channels. cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c. cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c. m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations. Methods: forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations. Notes: This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules. Examples: def __init__(self, c1, c2, n1, e0.5): Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio. super().__init__() assert c1 c2 self.c int(c1 * e) self.cv1 Conv(c1, 2 * self.c, 1, 1) self.cv2 Conv(2 * self.c, c1, 1) self.m nn.Sequential(*(PSABlock(self.c, attn_ratio0.5, num_headsself.c // 64) for _ in range(n))) def forward(self, x): Processes the input tensor x through a series of PSA blocks and returns the transformed tensor. a, b self.cv1(x).split((self.c, self.c), dim1) b self.m(b) return self.cv2(torch.cat((a, b), 1)) if __name__ __main__: # Generating Sample image image_size (1, 64, 224, 224) image torch.rand(*image_size) # Model model C2PSA_CGA(64, 64) out model(image) print(out.size())四、CGA的添加方式下面的步骤如果你不会或者不想麻烦操作可以联系作者获得本专栏添加所有项目文件的源代码可直接训练.4.1 修改一第一还是建立文件我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是Addmodules文件夹4.2 修改二然后在Addmodules文件夹内建立一个新的py文件将本文章节三中的“核心代码复制粘贴进去。4.3 修改三第二步我们在该目录下创建一个新的py文件名字为__init__.py然后在其内部导入我们的文件如下图所示。4.4 修改四第三步我门中到如下文件ultralytics/nn/tasks.py进行导入和注册我们的模块(此处只需要添加一次即可如果你用我其它的改进机制这里的步骤只需要添加一次)4.5 修改五在ultralytics/nn/tasks.py文件内的parse_model方法函数内位置大概在1500行左右按照图示位置添加即可此处需要自己有一定的判别能力如果不会可联系作者获得视频教程。4.6 修改六在ultralytics/nn/tasks.py文件内的parse_model方法函数内位置大概在1550行左右按照图示位置添加即可此处一定要对应好位置和缩进否则很容易报错。elif m in {此处填写本章代码的名字.}: c2 ch[f] args [c2, *args]五、正式训练5.1 yaml文件5.1.1 yaml文件1训练信息YOLO26-C2PSA-CGA summary: 271 layers, 2,479,040 parameters, 2,479,040 gradients, 5.8 GFLOPs# Ultralytics AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. modelyolo26n.yaml will call yolo26.yaml with scale n # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSA_CGA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.1.2 yaml文件2训练信息YOLO26-Att-CGA summary: 279 layers, 2,512,720 parameters, 2,512,720 gradients, 5.9 GFLOPs# Ultralytics AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. modelyolo26n.yaml will call yolo26.yaml with scale n # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [16, 1, LocalWindowAttention, []] # 23 # - [19, 1, LocalWindowAttention, []] # 24 # - [22, 1, LocalWindowAttention, []] # 25 # 此处的使用说法注释: 其中上面的三个注意力机制目前仅使用了23层如果你想使用24层那么就取消掉代码注释 # 并将下面检测头中的19改为24,如果想使用第25层注意力机制同理将下面检测头中的22改为25即可。 # 此处用法比较复杂如过不会联系Snu77博主获取视频教程 - [[23, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.2 训练代码大家可以创建一个py文件将我给的代码复制粘贴进去配置好自己的文件路径即可运行。import warnings warnings.filterwarnings(ignore) from ultralytics import YOLO if __name__ __main__: model YOLO(模型配置文件地址,也就是5.1你保存到本地文件的地址) # 如何切换模型版本, 上面的ymal文件可以改为 yolo26s.yaml就是使用的26s, # 类似某个改进的yaml文件名称为yolo26-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolo26l-XXX.yaml即可改的是上面YOLO中间的名字不是配置文件的 # model.load(yolo26n.pt) # 是否加载预训练权重,科研不建议大家加载否则很难提升精度 model.train( datar数据集文件地址, # 如果大家任务是其它的ultralytics/cfg/default.yaml找到这里修改task可以改成detect, segment, classify, pose cacheFalse, imgsz640, epochs20, single_clsFalse, # 是否是单类别检测 batch16, close_mosaic0, workers0, device0, optimizerMuSGD, # using SGD/MuSGD # resume, # 这里是填写last.pt地址 ampTrue, # 如果出现训练损失为Nan可以关闭amp projectruns/train, nameexp, )5.3 训练过程截图五、本文总结到此本文的正式分享内容就结束了在这里给大家推荐我的YOLOv26改进有效涨点专栏本专栏目前为新开的平均质量分98分后期我会根据各种最新的前沿顶会进行论文复现也会对一些老的改进机制进行补充如果大家觉得本文帮助到你了订阅本专栏关注后续更多的更新~专栏链接YOLOv26有效涨点专栏包含Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制
yolov26改进 | 添加注意力机制篇 | 实现级联群体注意力机制CGAttention改进C2PSA机制 (含独家网络结构图)
开始讲解之前推荐一下我的专栏本专栏的内容支持(分类、检测、分割、追踪、关键点检测),专栏目前为限时折扣欢迎大家订阅本专栏本专栏每周更新5-7篇最新机制更有包含我所有改进的文件和交流群提供给大家本人定期在群内分享发表论文方法和经验。一、本文介绍本文给大家带来的改进机制是实现级联群体注意力机制CascadedGroupAttention其主要思想为增强输入到注意力头的特征的多样性。与以前的自注意力不同它为每个头提供不同的输入分割并跨头级联输出特征。这种方法不仅减少了多头注意力中的计算冗余而且通过增加网络深度来提升模型容量亲测在我的25个类别的数据上大部分的类别均有一定的涨点效果仅有部分的类别保持不变同时给该注意力机制含有二次创新的机会欢迎大家订阅我的专栏一起学习YOLO专栏链接YOLOv26有效涨点专栏包含Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制目录一、本文介绍二、 CascadedGroupAttention的基本原理三、CGA的核心代码四、CGA的添加方式4.1 修改一4.2 修改二4.3 修改三4.4 修改四4.5 修改五4.6 修改六五、正式训练5.1 yaml文件5.1.1 yaml文件15.1.2 yaml文件25.2 训练代码5.3 训练过程截图五、本文总结二、 CascadedGroupAttention的基本原理官方论文地址官方论文地址点击即可跳转官方代码地址官方代码地址点击即可跳转Cascaded Group Attention (CGA)是在文章 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention 中提出的一种新型注意力机制。其核心思想是增强输入到注意力头的特征的多样性。与以前的自注意力不同它为每个头提供不同的输入分割并跨头级联输出特征。这种方法不仅减少了多头注意力中的计算冗余而且通过增加网络深度来提升模型容量。具体来说CGA 将输入特征分成不同的部分每部分输入到一个注意力头。每个头计算其自注意力映射然后将所有头的输出级联起来并通过一个线性层将它们投影回输入的维度。通过这样的方式CGA 在不增加额外参数的情况下提高了模型的计算效率。另外通过串联的方式每个头的输出都会添加到下一个头的输入中从而逐步精化特征表示。Cascaded Group Attention 的优点包括1. 提高了注意力图的多样性。2. 减少了计算冗余因为它减少了 QKV 层中输入和输出通道的数量。3. 增加了网络深度从而进一步提高了模型容量同时只增加了很小的延迟开销因为每个头的 QK 通道维度较小。这张图描绘了 EfficientViT 模型中 Cascaded Group Attention (CGA) 模块的架构。CGA模块位于图中的(c)部分可以看到它的作用是处理输入特征并提供分级的注意力机制。在这个模块中输入首先被分割成多个部分每个部分对应一个注意力头。每个头独立地计算其自注意力并产生一个输出。然后所有头的输出被级联(concatenate)在一起通过一个线性投影层形成最终的输出。这种设计允许模型在不同的层次上捕捉特征通过级联增强了特征之间的交互同时提高了计算效率。级联组注意力的关键点在于每个注意力头只关注输入的一部分然后把所有头的注意力合并起来来获取一个全面的特征表示。这样做的好处是减少了计算重复并增加了注意力的多样性因为不同的头可能会关注输入的不同方面。这种方法提高了模型的内存和计算效率同时保持或增强模型的性能。三、CGA的核心代码代码使用方式看章节四# https://github.com/microsoft/Cream/blob/ef68993c764f241a768cd69a087ed567dec6cb40/EfficientViT/classification/model/efficientvit.py#L104-L181 import itertools import torch from torch import nn __all__ [C2PSA_CGA, LocalWindowAttention] class Conv2d_BN(torch.nn.Sequential): def __init__(self, a, b, ks1, stride1, pad0, dilation1, groups1, bn_weight_init1, resolution-10000): super().__init__() self.add_module(c, torch.nn.Conv2d( a, b, ks, stride, pad, dilation, groups, biasFalse)) self.add_module(bn, torch.nn.BatchNorm2d(b)) torch.nn.init.constant_(self.bn.weight, bn_weight_init) torch.nn.init.constant_(self.bn.bias, 0) torch.no_grad() def switch_to_deploy(self): c, bn self._modules.values() w bn.weight / (bn.running_var bn.eps)**0.5 w c.weight * w[:, None, None, None] b bn.bias - bn.running_mean * bn.weight / \ (bn.running_var bn.eps)**0.5 m torch.nn.Conv2d(w.size(1) * self.c.groups, w.size( 0), w.shape[2:], strideself.c.stride, paddingself.c.padding, dilationself.c.dilation, groupsself.c.groups) m.weight.data.copy_(w) m.bias.data.copy_(b) return m class CascadedGroupAttention(torch.nn.Module): r Cascaded Group Attention. Args: dim (int): Number of input channels. key_dim (int): The dimension for query and key. num_heads (int): Number of attention heads. attn_ratio (int): Multiplier for the query dim for value dimension. resolution (int): Input resolution, correspond to the window size. kernels (List[int]): The kernel size of the dw conv on query. def __init__(self, dim, key_dim, num_heads8, attn_ratio4, resolution14, kernels[5, 5, 5, 5], ): super().__init__() self.num_heads num_heads self.scale key_dim ** -0.5 self.key_dim key_dim self.d int(attn_ratio * key_dim) self.attn_ratio attn_ratio qkvs [] dws [] for i in range(num_heads): qkvs.append(Conv2d_BN(dim // (num_heads), self.key_dim * 2 self.d, resolutionresolution)) dws.append(Conv2d_BN(self.key_dim, self.key_dim, kernels[i], 1, kernels[i] // 2, groupsself.key_dim, resolutionresolution)) self.qkvs torch.nn.ModuleList(qkvs) self.dws torch.nn.ModuleList(dws) self.proj torch.nn.Sequential(torch.nn.ReLU(), Conv2d_BN( self.d * num_heads, dim, bn_weight_init0, resolutionresolution)) points list(itertools.product(range(resolution), range(resolution))) N len(points) attention_offsets {} idxs [] for p1 in points: for p2 in points: offset (abs(p1[0] - p2[0]), abs(p1[1] - p2[1])) if offset not in attention_offsets: attention_offsets[offset] len(attention_offsets) idxs.append(attention_offsets[offset]) self.attention_biases torch.nn.Parameter( torch.zeros(num_heads, len(attention_offsets))) self.register_buffer(attention_bias_idxs, torch.LongTensor(idxs).view(N, N)) torch.no_grad() def train(self, modeTrue): super().train(mode) if mode and hasattr(self, ab): del self.ab else: self.ab self.attention_biases[:, self.attention_bias_idxs] def forward(self, x): # x (B,C,H,W) B, C, H, W x.shape trainingab self.attention_biases[:, self.attention_bias_idxs] feats_in x.chunk(len(self.qkvs), dim1) feats_out [] feat feats_in[0] for i, qkv in enumerate(self.qkvs): if i 0: # add the previous output to the input feat feat feats_in[i] feat qkv(feat) q, k, v feat.view(B, -1, H, W).split([self.key_dim, self.key_dim, self.d], dim1) # B, C/h, H, W q self.dws[i](q) q, k, v q.flatten(2), k.flatten(2), v.flatten(2) # B, C/h, N attn ( (q.transpose(-2, -1) k) * self.scale (trainingab[i] if self.training else self.ab[i]) ) attn attn.softmax(dim-1) # BNN feat (v attn.transpose(-2, -1)).view(B, self.d, H, W) # BCHW feats_out.append(feat) x self.proj(torch.cat(feats_out, 1)) return x class LocalWindowAttention(torch.nn.Module): r Local Window Attention. Args: dim (int): Number of input channels. key_dim (int): The dimension for query and key. num_heads (int): Number of attention heads. attn_ratio (int): Multiplier for the query dim for value dimension. resolution (int): Input resolution. window_resolution (int): Local window resolution. kernels (List[int]): The kernel size of the dw conv on query. def __init__(self, dim, num_heads4, attn_ratio4, resolution14, window_resolution7, kernels[5, 5, 5, 5], ): super().__init__() key_dim dim // 16 # 必须放缩16倍否则会报错 self.dim dim self.num_heads num_heads self.resolution resolution assert window_resolution 0, window_size must be greater than 0 self.window_resolution window_resolution self.attn CascadedGroupAttention(dim, key_dim, num_heads, attn_ratioattn_ratio, resolutionwindow_resolution, kernelskernels, ) def forward(self, x): B, C, H, W x.shape if H self.window_resolution and W self.window_resolution: x self.attn(x) else: x x.permute(0, 2, 3, 1) pad_b (self.window_resolution - H % self.window_resolution) % self.window_resolution pad_r (self.window_resolution - W % self.window_resolution) % self.window_resolution padding pad_b 0 or pad_r 0 if padding: x torch.nn.functional.pad(x, (0, 0, 0, pad_r, 0, pad_b)) pH, pW H pad_b, W pad_r nH pH // self.window_resolution nW pW // self.window_resolution # window partition, BHWC - B(nHh)(nWw)C - BnHnWhwC - (BnHnW)hwC - (BnHnW)Chw x x.view(B, nH, self.window_resolution, nW, self.window_resolution, C).transpose(2, 3).reshape( B * nH * nW, self.window_resolution, self.window_resolution, C ).permute(0, 3, 1, 2) x self.attn(x) # window reverse, (BnHnW)Chw - (BnHnW)hwC - BnHnWhwC - B(nHh)(nWw)C - BHWC x x.permute(0, 2, 3, 1).view(B, nH, nW, self.window_resolution, self.window_resolution, C).transpose(2, 3).reshape(B, pH, pW, C) if padding: x x[:, :H, :W].contiguous() x x.permute(0, 3, 1, 2) return x def autopad(k, pNone, d1): # kernel, padding, dilation Pad to same shape outputs. if d 1: k d * (k - 1) 1 if isinstance(k, int) else [d * (x - 1) 1 for x in k] # actual kernel-size if p is None: p k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad return p class Conv(nn.Module): Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation). default_act nn.SiLU() # default activation def __init__(self, c1, c2, k1, s1, pNone, g1, d1, actTrue): Initialize Conv layer with given arguments including activation. super().__init__() self.conv nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groupsg, dilationd, biasFalse) self.bn nn.BatchNorm2d(c2) self.act self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity() def forward(self, x): Apply convolution, batch normalization and activation to input tensor. return self.act(self.bn(self.conv(x))) def forward_fuse(self, x): Perform transposed convolution of 2D data. return self.act(self.conv(x)) class PSABlock(nn.Module): PSABlock class implementing a Position-Sensitive Attention block for neural networks. This class encapsulates the functionality for applying multi-head attention and feed-forward neural network layers with optional shortcut connections. Attributes: attn (Attention): Multi-head attention module. ffn (nn.Sequential): Feed-forward neural network module. add (bool): Flag indicating whether to add shortcut connections. Methods: forward: Performs a forward pass through the PSABlock, applying attention and feed-forward layers. Examples: Create a PSABlock and perform a forward pass def __init__(self, c, attn_ratio0.5, num_heads4, shortcutTrue) - None: Initializes the PSABlock with attention and feed-forward layers for enhanced feature extraction. super().__init__() self.attn LocalWindowAttention(c) self.ffn nn.Sequential(Conv(c, c * 2, 1), Conv(c * 2, c, 1, actFalse)) self.add shortcut def forward(self, x): Executes a forward pass through PSABlock, applying attention and feed-forward layers to the input tensor. x x self.attn(x) if self.add else self.attn(x) x x self.ffn(x) if self.add else self.ffn(x) return x class C2PSA_CGA(nn.Module): C2PSA module with attention mechanism for enhanced feature extraction and processing. This module implements a convolutional block with attention mechanisms to enhance feature extraction and processing capabilities. It includes a series of PSABlock modules for self-attention and feed-forward operations. Attributes: c (int): Number of hidden channels. cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c. cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c. m (nn.Sequential): Sequential container of PSABlock modules for attention and feed-forward operations. Methods: forward: Performs a forward pass through the C2PSA module, applying attention and feed-forward operations. Notes: This module essentially is the same as PSA module, but refactored to allow stacking more PSABlock modules. Examples: def __init__(self, c1, c2, n1, e0.5): Initializes the C2PSA module with specified input/output channels, number of layers, and expansion ratio. super().__init__() assert c1 c2 self.c int(c1 * e) self.cv1 Conv(c1, 2 * self.c, 1, 1) self.cv2 Conv(2 * self.c, c1, 1) self.m nn.Sequential(*(PSABlock(self.c, attn_ratio0.5, num_headsself.c // 64) for _ in range(n))) def forward(self, x): Processes the input tensor x through a series of PSA blocks and returns the transformed tensor. a, b self.cv1(x).split((self.c, self.c), dim1) b self.m(b) return self.cv2(torch.cat((a, b), 1)) if __name__ __main__: # Generating Sample image image_size (1, 64, 224, 224) image torch.rand(*image_size) # Model model C2PSA_CGA(64, 64) out model(image) print(out.size())四、CGA的添加方式下面的步骤如果你不会或者不想麻烦操作可以联系作者获得本专栏添加所有项目文件的源代码可直接训练.4.1 修改一第一还是建立文件我们找到如下ultralytics/nn文件夹下建立一个目录名字呢就是Addmodules文件夹4.2 修改二然后在Addmodules文件夹内建立一个新的py文件将本文章节三中的“核心代码复制粘贴进去。4.3 修改三第二步我们在该目录下创建一个新的py文件名字为__init__.py然后在其内部导入我们的文件如下图所示。4.4 修改四第三步我门中到如下文件ultralytics/nn/tasks.py进行导入和注册我们的模块(此处只需要添加一次即可如果你用我其它的改进机制这里的步骤只需要添加一次)4.5 修改五在ultralytics/nn/tasks.py文件内的parse_model方法函数内位置大概在1500行左右按照图示位置添加即可此处需要自己有一定的判别能力如果不会可联系作者获得视频教程。4.6 修改六在ultralytics/nn/tasks.py文件内的parse_model方法函数内位置大概在1550行左右按照图示位置添加即可此处一定要对应好位置和缩进否则很容易报错。elif m in {此处填写本章代码的名字.}: c2 ch[f] args [c2, *args]五、正式训练5.1 yaml文件5.1.1 yaml文件1训练信息YOLO26-C2PSA-CGA summary: 271 layers, 2,479,040 parameters, 2,479,040 gradients, 5.8 GFLOPs# Ultralytics AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. modelyolo26n.yaml will call yolo26.yaml with scale n # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSA_CGA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [[16, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.1.2 yaml文件2训练信息YOLO26-Att-CGA summary: 279 layers, 2,512,720 parameters, 2,512,720 gradients, 5.9 GFLOPs# Ultralytics AGPL-3.0 License - https://ultralytics.com/license # Ultralytics YOLO26 object detection model with P3/8 - P5/32 outputs # Model docs: https://docs.ultralytics.com/models/yolo26 # Task docs: https://docs.ultralytics.com/tasks/detect # Parameters nc: 80 # number of classes end2end: True # whether to use end-to-end mode reg_max: 1 # DFL bins scales: # model compound scaling constants, i.e. modelyolo26n.yaml will call yolo26.yaml with scale n # [depth, width, max_channels] n: [0.50, 0.25, 1024] # summary: 260 layers, 2,572,280 parameters, 2,572,280 gradients, 6.1 GFLOPs s: [0.50, 0.50, 1024] # summary: 260 layers, 10,009,784 parameters, 10,009,784 gradients, 22.8 GFLOPs m: [0.50, 1.00, 512] # summary: 280 layers, 21,896,248 parameters, 21,896,248 gradients, 75.4 GFLOPs l: [1.00, 1.00, 512] # summary: 392 layers, 26,299,704 parameters, 26,299,704 gradients, 93.8 GFLOPs x: [1.00, 1.50, 512] # summary: 392 layers, 58,993,368 parameters, 58,993,368 gradients, 209.5 GFLOPs # YOLO26n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 2, C3k2, [256, False, 0.25]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 2, C3k2, [512, False, 0.25]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 2, C3k2, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 2, C3k2, [1024, True]] - [-1, 1, SPPF, [1024, 5, 3, True]] # 9 - [-1, 2, C2PSA, [1024]] # 10 # YOLO26n head head: - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 2, C3k2, [512, True]] # 13 - [-1, 1, nn.Upsample, [None, 2, nearest]] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 2, C3k2, [256, True]] # 16 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 13], 1, Concat, [1]] # cat head P4 - [-1, 2, C3k2, [512, True]] # 19 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 10], 1, Concat, [1]] # cat head P5 - [-1, 1, C3k2, [1024, True, 0.5, True]] # 22 (P5/32-large) - [16, 1, LocalWindowAttention, []] # 23 # - [19, 1, LocalWindowAttention, []] # 24 # - [22, 1, LocalWindowAttention, []] # 25 # 此处的使用说法注释: 其中上面的三个注意力机制目前仅使用了23层如果你想使用24层那么就取消掉代码注释 # 并将下面检测头中的19改为24,如果想使用第25层注意力机制同理将下面检测头中的22改为25即可。 # 此处用法比较复杂如过不会联系Snu77博主获取视频教程 - [[23, 19, 22], 1, Detect, [nc]] # Detect(P3, P4, P5)5.2 训练代码大家可以创建一个py文件将我给的代码复制粘贴进去配置好自己的文件路径即可运行。import warnings warnings.filterwarnings(ignore) from ultralytics import YOLO if __name__ __main__: model YOLO(模型配置文件地址,也就是5.1你保存到本地文件的地址) # 如何切换模型版本, 上面的ymal文件可以改为 yolo26s.yaml就是使用的26s, # 类似某个改进的yaml文件名称为yolo26-XXX.yaml那么如果想使用其它版本就把上面的名称改为yolo26l-XXX.yaml即可改的是上面YOLO中间的名字不是配置文件的 # model.load(yolo26n.pt) # 是否加载预训练权重,科研不建议大家加载否则很难提升精度 model.train( datar数据集文件地址, # 如果大家任务是其它的ultralytics/cfg/default.yaml找到这里修改task可以改成detect, segment, classify, pose cacheFalse, imgsz640, epochs20, single_clsFalse, # 是否是单类别检测 batch16, close_mosaic0, workers0, device0, optimizerMuSGD, # using SGD/MuSGD # resume, # 这里是填写last.pt地址 ampTrue, # 如果出现训练损失为Nan可以关闭amp projectruns/train, nameexp, )5.3 训练过程截图五、本文总结到此本文的正式分享内容就结束了在这里给大家推荐我的YOLOv26改进有效涨点专栏本专栏目前为新开的平均质量分98分后期我会根据各种最新的前沿顶会进行论文复现也会对一些老的改进机制进行补充如果大家觉得本文帮助到你了订阅本专栏关注后续更多的更新~专栏链接YOLOv26有效涨点专栏包含Conv、注意力机制、主干/Backbone、损失函数、优化器、后处理等改进机制