融合多头注意力的轻量级作物病虫害识别

赵法川; 徐晓辉; 宋涛; 郝淼淼; 汪曙; 朱伟龙

doi:10.7671/j.issn.1001-411X.202208051

融合多头注意力的轻量级作物病虫害识别

河北工业大学电子信息工程学院, 天津 300401

基金项目: 河北省重点研发计划(20327201D)

详细信息

作者简介:
赵法川，硕士研究生，主要从事农业物联网研究，E-mail: 202031903037@stu.hebut.edu.cn

通讯作者:
徐晓辉，研究员，主要从事传感器及智能系统研究，E-mail: xxh@hebut.edu.cn

中图分类号: S435
计量
- 文章访问数: 683
- HTML全文浏览量: 18
- PDF下载量: 26
出版历程
- 收稿日期: 2022-08-30
- 网络出版日期: 2023-11-12
- 发布日期: 2023-05-29
- 刊出日期: 2023-11-09

A lightweight crop pest identification method based on multi-head attention

College of Electronic Information Engineering, Hebei University of Technology, Tianjin 300401, China

摘要

摘要:
目的
解决当前病虫害识别方法参数多、计算量大、难以在边缘嵌入式设备部署的问题，实现农作物病虫害精准识别，提高农作物产量和品质。
方法
提出一种融合多头注意力的轻量级卷积网络(Multi-head attention to convolutional neural network，M2CNet)。M2CNet采用层级金字塔结构，首先，结合深度可分离残差和循环全连接残差构建局部捕获块，用来捕捉短距离信息；其次，结合全局子采样注意力和轻量级前馈网络构建轻量级全局捕获块，用来捕捉长距离信息。提出M2CNet-S/B/L 3个变体以满足不同的边缘部署需求。
结果
M2CNet-S/B/L参数量分别为1.8M、3.5M和5.8M，计算量(Floating point operations，FLOPs)分别为0.23G、0.39G和0.60G。M2CNet-S/B/L对PlantVillage病害数据集取得了大于99.7%的Top5准确率和大于95.9%的Top1准确率，对IP102虫害数据集取得了大于88.4%的Top5准确率和大于67.0%的Top1准确率，且比同级别的模型表现优异。
结论
该方法能够对作物病虫害进行有效识别，且可为边缘侧工程部署提供有益参考。
- 病虫害识别 /
- 轻量级 /
- 多头注意力 /
- 残差学习 /
- 深度可分离
Abstract:
Objective
To solve the problems that the current pest identification method has many parameters, a large amount of calculation and is difficult to deploy embedded devices at the edge, so as to realize accurate identification of crop pests and diseases, and improve crop yield and quality.
Method
A lightweight convolutional neural network called multi-head attention to convolutional neural network (M2CNet) was proposed. M2CNet adopted hierarchical pyramid structure. Firstly, a local capture block was constructed by combining depth separable residual and cyclic fully connected residual to capture short-range information. Secondly, a lightweight global capture block was constructed by combining global subsampling attention and lightweight feedforward network to capture long-distance information. Three variants, namely M2CNet-S, M2CNet-B, and M2CNet-L, were proposed by M2CNet to meet different edge deployment requirements.
Result
M2CNet-S/B/L had parameter sizes of 1.8M, 3.5M and 5.8M, and floating point operations of 0.23G, 0.39G, and 0.60G, respectively. M2CNet-S/B/L achieved top5 accuracy greater than 99.7% and top1 accuracy greater than 95.9% in PlantVillage disease dataset, and top5 accuracy greater than 88.4% and top1 accuracy greater than 67.0% in IP102 pest dataset, outperforming models of the same level in comparison.
Conclusion
Effective identification of crop diseases and pests can be achieved by this method, and it provides valuable references for edge engineering deployment.
- Diseases and pests identification /
- Lightweight /
- Multi-head attention /
- Residual learning /
- Deeply separable

HTML全文

图 1 M2CNet网络总体组成

LCB：局部捕获块；LGCB：轻量级全局捕获块；H和W分别代表输入图片的高度和宽度； ${C}_{i}$ ：指用于阶段i的通道数； ${L}_{i}$ ：阶段i的局部捕获块和轻量级全局捕获块数量

Figure 1. Overall structure of the M2CNet network

LCB: Local capture block; LGCB: Lightweight global capture block; H and W represent the height and width of the input image, respectively; ${C}_{i}:$ Number of channels used for stage i; ${L}_{i}$ represents the number of local capture blocks and lightweight global capture blocks in stage i

下载: 全尺寸图片幻灯片

图 2 局部捕捉块结构图

a：深度可分离卷积；b：多层循环全连接

Figure 2. Structure diagram of a local snap block

a : Deep separable convolution; b: Multi-layer loop fully connected

下载: 全尺寸图片幻灯片

图 3 标准多头注意力(a)与全局子采样注意力(b)的对比

Q、K、V分别表示查询、键和值，H、W分别表示输入图片的高度和宽度，s表示子窗口大小，C表示通道数

Figure 3. Comparison of standard multi-head attention and global subsampling attention

Q, K, V represent query, key and value respectively, H and W represent the height and width of the input picture respectively, s represents size of the subwindow, C represents number of channels

下载: 全尺寸图片幻灯片

图 4 轻量级全局捕获块

a：条件位置编码；b：全局子采样注意力；c：轻量级前馈网络；d_i：通道维数；s：子窗口的大小；h：多头注意力头的数量；H和W分别代表输入特征的高度和宽度

Figure 4. Lightweight global capture block

a: Conditional position encoding; b: Global subsampling attention; c: Lightweight feedforward network; d_i: Channel dimension; s: Size of the sub window; h: Number of attention heads with multiple heads; H and W represent the height and width of the input features, respectively

下载: 全尺寸图片幻灯片

图 5 M2CNet-S/B/L在CIFAR100数据集的训练过程

Figure 5. M2CNet-S/B/L training process in the CIFAR100 dataset

下载: 全尺寸图片幻灯片

图 6 病虫害数据集识别结果

柱状图的宽度与模型参数呈线性关系，参数量越大柱状图越宽；同一色系代表同一对照，同一色系中颜色最深的柱子对应M2CNet变体

Figure 6. Identification results of pest data sets

The width of the bar chart is linearly related to the model parameters, the larger the number of parameters, the wider the bar chart; The same color system represents the same control, and the darkest column in the same color system corresponds to the M2CNet variant

下载: 全尺寸图片幻灯片

图 7 网络关注区域热力图

红色高亮部分代表网络关注度高的区域，冷色发暗部分代表网络关注度低的区域

Figure 7. Thermal map of the network focus area

The highlighted areas in red represent areas with high network attention, while the dark areas in cool colors represent areas with low network attention

下载: 全尺寸图片幻灯片

表 1 IP102 数据集害虫分级分类体系

Table 1 Taxonomy of the IP102 dataset on different class levels

作物 Crop	害虫类别 Pest class	训练集 Training set	测试集 Test set
水稻 Rice	14	6734	1683
玉米 Corn	13	11212	2803
小麦 Wheat	9	2734	684
甜菜 Sugarbeet	8	3536	884
苜蓿 Alfalfa	13	8312	2078
葡萄 Grape	16	14041	3510
柑橘 Orange	19	5818	1455
芒果 Mango	10	7790	1948
总计 Total	102	60177	15045

下载: 导出CSV

表 2 M2CNet-S/B/L的网络架构 ¹⁾

Table 2 M2CNet-S/B/L network architecture

阶段 Stage	输出尺寸 Output size	层名称 Name of layer	M2CNet-S	M2CNet-B	M2CNet-L
1	$56\times 56$	Conv.下采样	$4\times \mathrm{4,36},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\;4$	$4\times \mathrm{4,48},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\;4$
	$56\times 56$	深度可分离卷积	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,36}\\ 3\times \mathrm{1,1}\times \mathrm{3,36}\\ \begin{array}{c}{H}_{1}=1,{s}_{1}=4\\ {R}_{1}=4\end{array}\end{array} \right]\times 1$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,48}\\ 3\times \mathrm{1,1}\times \mathrm{3,48}\\ \begin{array}{c}{H}_{1}=1,{s}_{1}=4\\ {R}_{1}=4\end{array}\end{array} \right]\times 1$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,48}\\ 3\times \mathrm{1,1}\times \mathrm{3,48}\\ \begin{array}{c}{H}_{1}=1,{s}_{1}=4\\ {R}_{1}=4\end{array}\end{array} \right]\times 1$
		多层循环全连接
		全局子采样注意力
		轻量级前馈网络
2	$28\times 28$	Conv.下采样	$2\times \mathrm{2,72},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\; 2$	$2\times \mathrm{2,96},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\; 2$
	$28\times 28$	深度可分离卷积	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,72}\\ 3\times \mathrm{1,1}\times \mathrm{3,72}\\ \begin{array}{c}{H}_{1}=2,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 2$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,96}\\ 3\times \mathrm{1,1}\times \mathrm{3,96}\\ \begin{array}{c}{H}_{1}=2,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 1$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,96}\\ 3\times \mathrm{1,1}\times \mathrm{3,96}\\ \begin{array}{c}{H}_{1}=2,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 2$
		多层循环全连接
		全局子采样注意力
		轻量级前馈网络
3	$14\times 14$	Conv.下采样	$2\times \mathrm{2,144},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\; 2$	$2\times \mathrm{2,192},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\; 2$
	$14\times 14$	深度可分离卷积	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,144}\\ 3\times \mathrm{1,1}\times \mathrm{3,144}\\ \begin{array}{c}{H}_{1}=4,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 3$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,192}\\ 3\times \mathrm{1,1}\times \mathrm{3,192}\\ \begin{array}{c}{H}_{1}=4,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 4$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,192}\\ 3\times \mathrm{1,1}\times \mathrm{3,192}\\ \begin{array}{c}{H}_{1}=4,{s}_{1}=2\\ {R}_{1}=4\end{array}\end{array} \right]\times 6$
		多层循环全连接
		全局子采样注意力
		轻量级前馈网络
4	$7\times 7$	Conv.下采样	$2\times \mathrm{2,288},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e}\; 2$	$2\times \mathrm{2,384},\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{e} \;2$
	$7\times 7$	深度可分离卷积	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,288}\\ 3\times \mathrm{1,1}\times \mathrm{3,288}\\ \begin{array}{c}{H}_{1}=8,{s}_{1}=1\\ {R}_{1}=4\end{array}\end{array} \right]\times 2$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,384}\\ 3\times \mathrm{1,1}\times \mathrm{3,384}\\ \begin{array}{c}{H}_{1}=8,{s}_{1}=1\\ {R}_{1}=4\end{array}\end{array} \right]\times 2$	$\left[ \begin{array}{c}3\times \mathrm{3,1}\times \mathrm{1,384}\\ 3\times \mathrm{1,1}\times \mathrm{3,384}\\ \begin{array}{c}{H}_{1}=8,{s}_{1}=1\\ {R}_{1}=4\end{array}\end{array} \right]\times 4$
		多层循环全连接
		全局子采样注意力
		轻量级前馈网络
输出 Output	$1\times 1$	全连接	100
参数量(M) No. of parameters			1.83	3.52	5.76
计算量(G) Floating point operations			0.23	0.39	0.60
1)输入图像大小默认为224像素×224像素，Conv.代表卷积操作，stride表示卷积的步幅，H_i和S_i是第i个全局子采样注意力的头数和次采样大小，R_i是第i个轻量级前馈网络的特征尺寸缩放比　1) The input image size is 224×224 by default, Conv. stands for convolution operation, stride stands for convolution step, H_i and S_i are the number of heads and subsampling size of the ith global subsampling, and R_i is the scaling ratio of the feature size of the ith lightweight feedforward network

下载: 导出CSV

表 3 CIFAR100数据集模型对比结果

Table 3 Comparison results of CIFAR100 dataset model

模型 Model	参数量 (M) No. of parameters	计算量 (G) Floating point operations	准确率/% Accuracy
模型 Model	参数量 (M) No. of parameters	计算量 (G) Floating point operations	Top5	Top1
ShuffleNet-V2 0.5	0.4	0.04	72.74	41.83
ShuffleNet-V2 1.0	1.4	0.15	86.21	59.65
ShuffleNet-V2 1.5	2.6	0.30	90.08	66.56
ShuffleNet-V2 2.0	5.6	0.56	93.06	72.79
SqueezeNet 1.0	0.8	0.75	78.48	49.68
SqueezeNet 1.1	0.8	0.30	78.12	50.14
MobileNet-V3-Small	1.6	0.06	87.90	61.74
MobileNet-V2	2.4	0.31	91.69	69.16
MobileNet-V3-Large	4.3	0.23	93.57	73.27
MnasNet 0.5	1.1	0.11	88.13	62.60
MnasNet 0.75	2.0	0.22	91.44	69.20
MnasNet 1.0	3.2	0.32	92.81	72.70
MnasNet 1.3	5.1	0.54	94.41	76.64
EfficientNet B0	4.1	0.40	94.63	76.00
EfficientNet B1	6.6	0.60	94.95	77.96
ResNet 18	11.2	1.80	94.66	76.85
VGG 11	129.2	7.60	94.25	75.82
VGG 13	129.4	11.30	94.38	76.46
VGG 16	134.7	15.50	94.63	78.19
VGG 19	140.0	19.60	95.25	78.19
MobileViT-XXS	1.0	0.33	84.98	55.96
MobileViT-XS	2.0	0.90	89.55	64.34
MobileViT-S	5.1	1.75	93.64	72.93
M2CNet-S	1.8	0.23	92.46	71.09
M2CNet-B	3.5	0.39	94.16	75.32
M2CNet-L	5.8	0.60	95.31	78.39

下载: 导出CSV

参考文献(31)

[1]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. arXiv: 1409.1556. https://arxiv.org/abs/1409.1556.
[2]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[3]	李静, 陈桂芬, 安宇. 基于优化卷积神经网络的玉米螟虫害图像识别[J]. 华南农业大学学报, 2020, 41(3): 110-116. doi: 10.7671/j.issn.1001-411X.201907017
[4]	刘洋, 冯全, 王书志. 基于轻量级CNN的植物病害识别方法及移动端应用[J]. 农业工程学报, 2019, 35(17): 194-204. doi: 10.11975/j.issn.1002-6819.2019.17.024
[5]	陆健强, 林佳翰, 黄仲强, 等. 基于Mixup算法和卷积神经网络的柑橘黄龙病果实识别研究[J]. 华南农业大学学报, 2021, 42(3): 94-101. doi: 10.7671/j.issn.1001-411X.202008041
[6]	邱文杰, 叶进, 胡亮青, 等. 面向植物病害识别的卷积神经网络精简结构Distilled-MobileNet模型[J]. 智慧农业(中英文), 2021(1): 109-117.
[7]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. arXiv: 2010.11929. https://arxiv.org/abs/2010.11929.
[8]	KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[R/OL]. Technical report: University of Toronto, https://www.cs.toronto.edu~kriz/learning-features-2009-TR.pdf.
[9]	HUGHES D P, SALATHE M. An open access repository of images on plant health to enable the development of mobile disease diagnostics[EB/OL]. arXiv: 1511.08060. https://arxiv. org/abs/1511.08060.
[10]	WU X P, ZHAN C, LAI Y K, et al. IP102: A large-scale benchmark dataset for insect pest recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE, 2020: 8779-8788.
[11]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778.
[12]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861. https://arxiv.org/abs/1704.0486.
[13]	CHEN S, XIE E, GE C, et al. CycleMLP: A MLP-like architecture for dense prediction[C]//International Conference on Learning Representations. OpenRe view. net, 2022: 1-21.
[14]	IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]// Proceedings of the 32nd International Conference on Machine Learning. New York: ACM, 2015: 448-456.
[15]	LIU Z, MAO H, WU C Y, et al. A convnet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA: IEEE, 2022: 11966-11976.
[16]	CHU X, TIAN Z, WANG Y, et al. Twins: Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems(NIPS). 2021, 34: 9355-9366.
[17]	CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[EB/OL]. arXiv: 2102.10882. https://arxiv.org/abs/2102.10882.
[18]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[19]	LOSHCHILOV I, HUTTER F. SGDR: Stochastic gradient descent with restarts[C]//International Conference on Learning Representations. Toulon: OpenReview. net, 2017: 1-16.
[20]	LOSHCHILOY I, HUTTER F. Decoupled weight decay regularization[C]//International Conference on Learning Representations. New Orleans: OpenReview. net, 2019: 1-19.
[21]	MULLER R, KORNBLITH S, HINTON G E. When does label smoothing help? [EB/OL]. arXiv: 1906.02629. https://arxiv.org/abs/1906.02629.
[22]	ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: Beyond empirical risk minimization[EB/OL]. arXiv: 1710.09412. https://arxiv.org/abs/1710.09412.
[23]	ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: An extremely efficient convolutional neural network for mobile devices[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 6848-6856.
[24]	MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision (ECCV). New York: ACM, 2018: 122-138.
[25]	IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size[EB/OL]//arXiv: 1602.07360. https://arxiv.org/abs/1602.07360.
[26]	SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: Inverted residuals and linear bottlenecks [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 4510-4520.
[27]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea: IEEE, 2020: 1314-1324.
[28]	TAN M X, CHEN B, PANG R M, et al. Mnasnet: Platform-aware neural architecture search for mobile[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2020: 2815-2823.
[29]	TAN M, LE Q. Efficientnet: Rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning. Long Beach, CA, USA: LR, 2019: 6105-6114.
[30]	MEHTA S, RASTEGARI M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. arXiv: 2110.02178. https://arxiv.org/abs/2110.02178.
[31]	SELVARAJU R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization[J]. International Journal of Computer Vision, 2020, 128(2): 336-359. doi: 10.1007/s11263-019-01228-7