A lightweight crop pest identification method based on multi-head attention
-
摘要:目的
解决当前病虫害识别方法参数多、计算量大、难以在边缘嵌入式设备部署的问题,实现农作物病虫害精准识别,提高农作物产量和品质。
方法提出一种融合多头注意力的轻量级卷积网络(Multi-head attention to convolutional neural network,M2CNet)。M2CNet采用层级金字塔结构,首先,结合深度可分离残差和循环全连接残差构建局部捕获块,用来捕捉短距离信息;其次,结合全局子采样注意力和轻量级前馈网络构建轻量级全局捕获块,用来捕捉长距离信息。提出M2CNet-S/B/L 3个变体以满足不同的边缘部署需求。
结果M2CNet-S/B/L参数量分别为1.8M、3.5M和5.8M,计算量(Floating point operations,FLOPs)分别为0.23G、0.39G和0.60G。M2CNet-S/B/L对PlantVillage病害数据集取得了大于99.7%的Top5准确率和大于95.9%的Top1准确率,对IP102虫害数据集取得了大于88.4%的Top5准确率和大于67.0%的Top1准确率,且比同级别的模型表现优异。
结论该方法能够对作物病虫害进行有效识别,且可为边缘侧工程部署提供有益参考。
Abstract:ObjectiveTo solve the problems that the current pest identification method has many parameters, a large amount of calculation and is difficult to deploy embedded devices at the edge, so as to realize accurate identification of crop pests and diseases, and improve crop yield and quality.
MethodA lightweight convolutional neural network called multi-head attention to convolutional neural network (M2CNet) was proposed. M2CNet adopted hierarchical pyramid structure. Firstly, a local capture block was constructed by combining depth separable residual and cyclic fully connected residual to capture short-range information. Secondly, a lightweight global capture block was constructed by combining global subsampling attention and lightweight feedforward network to capture long-distance information. Three variants, namely M2CNet-S, M2CNet-B, and M2CNet-L, were proposed by M2CNet to meet different edge deployment requirements.
ResultM2CNet-S/B/L had parameter sizes of 1.8M, 3.5M and 5.8M, and floating point operations of 0.23G, 0.39G, and 0.60G, respectively. M2CNet-S/B/L achieved top5 accuracy greater than 99.7% and top1 accuracy greater than 95.9% in PlantVillage disease dataset, and top5 accuracy greater than 88.4% and top1 accuracy greater than 67.0% in IP102 pest dataset, outperforming models of the same level in comparison.
ConclusionEffective identification of crop diseases and pests can be achieved by this method, and it provides valuable references for edge engineering deployment.
-
-
图 1 M2CNet网络总体组成
LCB:局部捕获块;LGCB:轻量级全局捕获块;H和W分别代表输入图片的高度和宽度;Ci:指用于阶段i的通道数;Li:阶段i的局部捕获块和轻量级全局捕获块数量
Figure 1. Overall structure of the M2CNet network
LCB: Local capture block; LGCB: Lightweight global capture block; H and W represent the height and width of the input image, respectively; Ci: Number of channels used for stage i; Li represents the number of local capture blocks and lightweight global capture blocks in stage i
图 3 标准多头注意力(a)与全局子采样注意力(b)的对比
Q、K、V分别表示查询、键和值,H、W分别表示输入图片的高度和宽度,s表示子窗口大小,C表示通道数
Figure 3. Comparison of standard multi-head attention and global subsampling attention
Q, K, V represent query, key and value respectively, H and W represent the height and width of the input picture respectively, s represents size of the subwindow, C represents number of channels
图 4 轻量级全局捕获块
a:条件位置编码;b:全局子采样注意力;c:轻量级前馈网络;di:通道维数;s:子窗口的大小;h:多头注意力头的数量;H和W分别代表输入特征的高度和宽度
Figure 4. Lightweight global capture block
a: Conditional position encoding; b: Global subsampling attention; c: Lightweight feedforward network; di: Channel dimension; s: Size of the sub window; h: Number of attention heads with multiple heads; H and W represent the height and width of the input features, respectively
图 6 病虫害数据集识别结果
柱状图的宽度与模型参数呈线性关系,参数量越大柱状图越宽;同一色系代表同一对照,同一色系中颜色最深的柱子对应M2CNet变体
Figure 6. Identification results of pest data sets
The width of the bar chart is linearly related to the model parameters, the larger the number of parameters, the wider the bar chart; The same color system represents the same control, and the darkest column in the same color system corresponds to the M2CNet variant
表 1 IP102 数据集害虫分级分类体系
Table 1 Taxonomy of the IP102 dataset on different class levels
作物
Crop害虫类别
Pest class训练集
Training set测试集
Test set水稻 Rice 14 6734 1683 玉米 Corn 13 11212 2803 小麦 Wheat 9 2734 684 甜菜 Sugarbeet 8 3536 884 苜蓿 Alfalfa 13 8312 2078 葡萄 Grape 16 14041 3510 柑橘 Orange 19 5818 1455 芒果 Mango 10 7790 1948 总计 Total 102 60177 15045 表 2 M2CNet-S/B/L的网络架构 1)
Table 2 M2CNet-S/B/L network architecture
阶段
Stage输出尺寸
Output size层名称
Name of layerM2CNet-S M2CNet-B M2CNet-L 1 56×56 Conv.下采样 4×4,36,stride4 4×4,48,stride4 56×56 深度可分离卷积 [3×3,1×1,363×1,1×3,36H1=1,s1=4R1=4]×1 [3×3,1×1,483×1,1×3,48H1=1,s1=4R1=4]×1 [3×3,1×1,483×1,1×3,48H1=1,s1=4R1=4]×1 多层循环全连接 全局子采样注意力 轻量级前馈网络 2 28×28 Conv.下采样 2×2,72,stride2 2×2,96,stride2 28×28 深度可分离卷积 [3×3,1×1,723×1,1×3,72H1=2,s1=2R1=4]×2 [3×3,1×1,963×1,1×3,96H1=2,s1=2R1=4]×1 [3×3,1×1,963×1,1×3,96H1=2,s1=2R1=4]×2 多层循环全连接 全局子采样注意力 轻量级前馈网络 3 14×14 Conv.下采样 2×2,144,stride2 2×2,192,stride2 14×14 深度可分离卷积 [3×3,1×1,1443×1,1×3,144H1=4,s1=2R1=4]×3 [3×3,1×1,1923×1,1×3,192H1=4,s1=2R1=4]×4 [3×3,1×1,1923×1,1×3,192H1=4,s1=2R1=4]×6 多层循环全连接 全局子采样注意力 轻量级前馈网络 4 7×7 Conv.下采样 2×2,288,stride2 2×2,384,stride2 7×7 深度可分离卷积 [3×3,1×1,2883×1,1×3,288H1=8,s1=1R1=4]×2 [3×3,1×1,3843×1,1×3,384H1=8,s1=1R1=4]×2 [3×3,1×1,3843×1,1×3,384H1=8,s1=1R1=4]×4 多层循环全连接 全局子采样注意力 轻量级前馈网络 输出 Output 1×1 全连接 100 参数量(M) No. of parameters 1.83 3.52 5.76 计算量(G) Floating point operations 0.23 0.39 0.60 1)输入图像大小默认为224像素×224像素,Conv.代表卷积操作,stride表示卷积的步幅,Hi和Si是第i个全局子采样注意力的头数和次采样大小,Ri是第i个轻量级前馈网络的特征尺寸缩放比
1) The input image size is 224×224 by default, Conv. stands for convolution operation, stride stands for convolution step, Hi and Si are the number of heads and subsampling size of the ith global subsampling, and Ri is the scaling ratio of the feature size of the ith lightweight feedforward network表 3 CIFAR100数据集模型对比结果
Table 3 Comparison results of CIFAR100 dataset model
模型
Model参数量 (M)
No. of
parameters计算量 (G)
Floating
point
operations准确率/%
AccuracyTop5 Top1 ShuffleNet-V2 0.5 0.4 0.04 72.74 41.83 ShuffleNet-V2 1.0 1.4 0.15 86.21 59.65 ShuffleNet-V2 1.5 2.6 0.30 90.08 66.56 ShuffleNet-V2 2.0 5.6 0.56 93.06 72.79 SqueezeNet 1.0 0.8 0.75 78.48 49.68 SqueezeNet 1.1 0.8 0.30 78.12 50.14 MobileNet-V3-Small 1.6 0.06 87.90 61.74 MobileNet-V2 2.4 0.31 91.69 69.16 MobileNet-V3-Large 4.3 0.23 93.57 73.27 MnasNet 0.5 1.1 0.11 88.13 62.60 MnasNet 0.75 2.0 0.22 91.44 69.20 MnasNet 1.0 3.2 0.32 92.81 72.70 MnasNet 1.3 5.1 0.54 94.41 76.64 EfficientNet B0 4.1 0.40 94.63 76.00 EfficientNet B1 6.6 0.60 94.95 77.96 ResNet 18 11.2 1.80 94.66 76.85 VGG 11 129.2 7.60 94.25 75.82 VGG 13 129.4 11.30 94.38 76.46 VGG 16 134.7 15.50 94.63 78.19 VGG 19 140.0 19.60 95.25 78.19 MobileViT-XXS 1.0 0.33 84.98 55.96 MobileViT-XS 2.0 0.90 89.55 64.34 MobileViT-S 5.1 1.75 93.64 72.93 M2CNet-S 1.8 0.23 92.46 71.09 M2CNet-B 3.5 0.39 94.16 75.32 M2CNet-L 5.8 0.60 95.31 78.39 -
[1] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. arXiv: 1409.1556. https://arxiv.org/abs/1409.1556.
[2] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[3] 李静, 陈桂芬, 安宇. 基于优化卷积神经网络的玉米螟虫害图像识别[J]. 华南农业大学学报, 2020, 41(3): 110-116. doi: 10.7671/j.issn.1001-411X.201907017 [4] 刘洋, 冯全, 王书志. 基于轻量级CNN的植物病害识别方法及移动端应用[J]. 农业工程学报, 2019, 35(17): 194-204. doi: 10.11975/j.issn.1002-6819.2019.17.024 [5] 陆健强, 林佳翰, 黄仲强, 等. 基于Mixup算法和卷积神经网络的柑橘黄龙病果实识别研究[J]. 华南农业大学学报, 2021, 42(3): 94-101. doi: 10.7671/j.issn.1001-411X.202008041 [6] 邱文杰, 叶进, 胡亮青, 等. 面向植物病害识别的卷积神经网络精简结构Distilled-MobileNet模型[J]. 智慧农业(中英文), 2021(1): 109-117. [7] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. arXiv: 2010.11929. https://arxiv.org/abs/2010.11929.
[8] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[R/OL]. Technical report: University of Toronto, https://www.cs.toronto.edu~kriz/learning-features-2009-TR.pdf.
[9] HUGHES D P, SALATHE M. An open access repository of images on plant health to enable the development of mobile disease diagnostics[EB/OL]. arXiv: 1511.08060. https://arxiv. org/abs/1511.08060.
[10] WU X P, ZHAN C, LAI Y K, et al. IP102: A large-scale benchmark dataset for insect pest recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2020: 8779-8788.
[11] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[12] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861. https://arxiv.org/abs/1704.0486.
[13] CHEN S, XIE E, GE C, et al. CycleMLP: A MLP-like architecture for dense prediction[C]//International Conference on Learning Representations. OpenRe view. net, 2022: 1-21.
[14] IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]// Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015: 448-456.
[15] LIU Z, MAO H, WU C Y, et al. A convnet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA: IEEE, 2022: 11966-11976.
[16] CHU X, TIAN Z, WANG Y, et al. Twins: Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems(NIPS). 2021, 34: 9355-9366.
[17] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[EB/OL]. arXiv: 2102.10882. https://arxiv.org/abs/2102.10882.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[19] LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with restarts[C]//International Conference on Learning Representations. Toulon: OpenReview. net, 2017: 1-16.
[20] LOSHCHILOY I, HUTTER F. Decoupled weight decay regularization[C]//International Conference on Learning Representations. New Orleans: OpenReview. net, 2019: 1-19.
[21] MULLER R, KORNBLITH S, HINTON G E. When does label smoothing help? [EB/OL]. arXiv: 1906.02629. https://arxiv.org/abs/1906.02629.
[22] ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: Beyond empirical risk minimization[EB/OL]. arXiv: 1710.09412. https://arxiv.org/abs/1710.09412.
[23] ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: An extremely efficient convolutional neural network for mobile devices[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6848-6856.
[24] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision (ECCV). New York: ACM, 2018: 122-138.
[25] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size[EB/OL]//arXiv: 1602.07360. https://arxiv.org/abs/1602.07360.
[26] SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: Inverted residuals and linear bottlenecks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 4510-4520.
[27] HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea: IEEE, 2020: 1314-1324.
[28] TAN M X, CHEN B, PANG R M, et al. Mnasnet: Platform-aware neural architecture search for mobile[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2020: 2815-2823.
[29] TAN M, LE Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning. Long Beach, CA, USA: LR, 2019: 6105-6114.
[30] MEHTA S, RASTEGARI M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. arXiv: 2110.02178. https://arxiv.org/abs/2110.02178.
[31] SELVARAJU R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization[J]. International Journal of Computer Vision, 2020, 128(2): 336-359. doi: 10.1007/s11263-019-01228-7