Research on Mask Wearing Detection of Natural Population Based on Improved YOLOv4
^†^†thanks: Xuecheng Wu, Mengmeng Tian, and Lanhang Zhai are all with school of Cyber Science and Engineering, Zhengzhou University. (Corresponding authors: Xuecheng Wu.)

Xuecheng Wu Zhengzhou University
Zhengzhou, Henan 450002, China
wuxc@stu.zzu.edu.cn Mengmeng Tian Zhengzhou University
Zhengzhou, Henan 450002, China
tmm@stu.zzu.edu.cn Lanhang Zhai Zhengzhou University
Zhengzhou, Henan 450002, China
zhailhang@163.com

Abstract

Recently, the domestic COVID-19 epidemic situation is serious, but in public places, some people do not wear masks or wear masks incorrectly, which requires the relevant staff to instantly remind and supervise them to wear masks correctly. However, in the face of such an important and complicated work, it is very necessary to carry out automated mask-wearing detection in public places. This paper proposes a mask-wearing detection method based on improved YOLOv4. Specifically, firstly, we add the Coordinate Attention Module to the backbone to coordinate feature fusion and representation. Secondly, we conduct a series of network structural improvements to enhance the model’s performance and robustness. Thirdly, we deploy the K-means clustering algorithm to make the nine anchor boxes more suitable for our NPMD dataset. The experiments show that the improved YOLOv4 performs better, exceeding the baseline by 4.06% AP with a comparable speed of 64.37 FPS.

YOLOv4, Coordinate Attention, K-means Clustering, Mask Wearing Detection.

I Introduction

The new coronavirus can survive in various droplets for 24-48 hours. Wearing a mask and covering our mouth and nose in public is an effective approach to prevent infection. However, in some public places such as shopping malls and subway stations, some people do not take the initiative to wear masks or wear masks incorrectly. As a result, we must carry out tasks related to mask-wearing detection. In order to efficiently detect and reduce the impact of detection on the masses, it is necessary for us to apply computer vision technology to mask wearing detection. Existing detectors have improved the detection effectiveness to a certain extent, but they are susceptible to the effects of the external environment, the shape and color of the masks, and the detection accuracy rate of incorrect mask-wearing conditions is low.

In recent years, methods of applying neural networks to complete object detection tasks have emerged in an endless stream, and many object detection methods based on region proposals such as Faster R-CNN have been proposed. Later, related researchers proposed one-stage object detectors, and the YOLO series has also developed rapidly. The YOLO detector [4] was first proposed by Joseph Redmon et al. in 2015 and developed YOLOv2 [5] and YOLOv3 [7]. In 2020, Alexey Bochkovskiy et al. proposed YOLOv4 [1], which is based on YOLOv3 and applies many emerging methods to achieve an optimal balance between speed and accuracy. In reality, the application of existing mainstream detection methods to mask wearing detection of the natural population will be affected by various factors, such as different styles and colors of masks, the skin color of the wearer, and weather. Therefore, the accuracy of the detectors is reduced to a certain extent, the robustness is poor, and the detectors can not meet the real-time requirements of natural scenes. Among them, Faster R-CNN has a high accuracy rate, but the detection speed can not meet the basic real-time requirements due to the limitation of its network structure; The performance of YOLOv3 [7] for small objects is not ideal, and the overall detection performance is also relatively terrible.

Based on the abovementioned problems, this paper further optimizes the YOLOv4 [1] and deploys the Coordinate Attention Module to coordinately represent the inter-channel relationship and precise positional information for the intermediate feature maps. We then adjust the structure of the original network, significantly enhancing the depth and capacity of the overall network. Moreover, we utilize the K-means clustering algorithm to make the nine anchor boxes more suitable for our NPMD dataset. In this approach, the improved YOLOv4 proposed in this paper can better complete the task of mask wearing detection in natural scenarios.

Ii Methodology

The network structure of YOLOv4 is composed of a backbone, a neck network, and three YOLO Heads for different levels. In this paper, we specifically optimize the YOLOv4 for the characteristics of the mask wearing detection for the natural population. The overall illustration of our improved YOLOv4 is shown in Fig. 1.

Ii-a Coordinate Attention Module

Coordinate Attention [2] is a new attention mechanism proposed by Qibin Hou and others at the National University of Singapore. This attention mechanism innovatively embeds specific positional information into inter-channel attention. This mechanism solves the common problems in SE, BAM and CBAM, which has better results and avoids introducing significant computational overhead. In this paper, after the convolution transformation layer denoted as DarknetConv2D-BN-Mish in the backbone, a Coordinate Attention Module is added before the residual convolution transformation blocks in order to strengthen the semantic representation of the shallow feature maps and obtain richer feature information over a larger region, which can further improve the performance of backbone. Compared with the original YOLOv4, the improved YOLOv4 combined with the Coordinate Attention Module can locate and identify the objects of interest more accurately.

Coordinate Attention first performs maximum average pooling in the horizontal and vertical directions and then conducts transformation to encode the specific positional information accurately. Finally, the specific positional information is fused by weighting on the feature channels. The Coordinate Attention Module is divided into two steps: coordinate information embedding and coordinate attention generation.

In the step of coordinate information embedding, the Coordinate Attention Module first divides the global average pooling into a pair of 1D feature encoding operations so that the Coordinate Attention Module can capture the remote spatial interactions with precise positional information. Given the input $X$ , we deploy the two pooling kernels of size (H, 1) and (1, W) to encode each specific feature channel along with the horizontal and vertical directions. The abovementioned two transformations perform feature aggregation along with the horizontal and vertical directions and yield a pair of direction-aware feature maps. As a result, the output of the $c$ -th channel at height $h$ and width $w$ can be formulated as Eq. 1 and Eq. 2, respectively.

q_{c}^{h} (h) = \frac{1}{W} \sum 0 \leq i < W x_{c} (h, i)

(1)

q_{c}^{w} (w) = \frac{1}{H} \sum 0 \leq j < H x_{c} (j, w)

(2)

The $q_{c}^{h}$ and $q_{c}^{w}$ in Eq. 1 and Eq. 2 denote the outputs which are obtained after the average pooling operation along the horizontal and vertical directions, respectively.

In the coordinate attention generation, as shown in Eq. 3, we first stack the specific pair of direction-aware feature maps generated in the step of coordinate information embedding and compress the number of feature channels by a $1 \times 1$ convolution transformation operation. We then encode the precise positional information in the horizontal and vertical directions through a BatchNorm layer and a ReLU layer. Afterward, $f$ is divided into two separate feature tensors $f^{h}$ and $f^{w}$ along the spatial dimensions, and two convolution transformation operations are utilized respectively to transform $f^{h}$ and $f^{w}$ to feature tensors with the same number of channels to the input $X$ . Then we can get $g^{h}$ and $g^{w}$ by normalized weighting, as shown in Eq. 4 and Eq. 5.

f = δ (F_{1} ([q^{h}, q^{w}]))

(3)

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{w} (f^{w}))

(5)

Fig. 1: The overall illustration of the improved YOLOv4 proposed in this paper. Here, SPP refers to the layer of spatial pyramid pooling.

In Eq. 3, $F_{1} (\cdot)$ denotes first stacking the two input feature tensors, followed by a $1 \times 1$ convolution transformation and a BatchNorm layer. $δ (\cdot)$ represents the ReLU activation function and $f \in R^{C / r \times (H + W)}$ is the encoded intermediate feature map. In Eq. 4 and Eq. 5, $F_{h} (\cdot)$ and $F_{w} (\cdot)$ respectively denote a $1 \times 1$ convolution transformation operation. $σ (\cdot)$ represents a sigmoid function. The outputs $g^{h}$ and $g^{w}$ are then extended and utilized as the Coordinate Attention weights, respectively. The final output of Coordinate Attention is shown in Eq. 6.

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

In Eq. 6, $x_{c} (i, j)$ denotes the original input feature tensor, $g_{c}^{h} (i)$ and $g_{c}^{w} (j)$ denotes the horizontal attention weight and the vertical attention weight, respectively. And $y_{c} (i, j)$ denotes the weighted feature maps.

Ii-B Network Structural Improvement

Although the neck network in the original YOLOv4 can strengthen the semantic representation capability of feature maps to a certain extent, it is susceptible to the natural scenes, mutual occlusion of people, and inconspicuous feature discrimination after people wearing masks during detection. Therefore, we make a series of neck network structural improvements to the original neck network. First, we increase the number of convolution transformation layers before and after the Spatial Pyramid Pooling layer from three to five. Second, inspired by the original YOLOv4, we increase the number of convolution transformation layers from one to three before $L 3$ and $L 4$ are fed into the neck network, respectively. The improved neck network deepens the capacity and depth of the overall network, obtaining larger receptive fields and richer semantic feature information, significantly improving the model performance.

Ii-C K-means Clustering

K-means clustering [3] is an iterative solution clustering analysis approach. The nine anchor boxes in the original YOLOv4 are computed initially by clustering the MS-COCO dataset utilizing K-means clustering. The MS-COCO contains 80 types of objects, and the sizes of the different objects varies greatly. If it is directly employed in the mask wearing detection, the sizes of some anchor boxes are unreasonable. Therefore, the K-means clustering is deployed again to perform clustering calculations on the NPMD dataset, which is collected by us, to obtain the sizes of the nine anchor boxes suitable for our dataset. In detail, first, we initialize the number of categories and cluster centers. We then calculate the distance, 1-IoU, between each bounding box and all cluster centers. After that, we choose the nearest cluster center as its category and utilize the average of each category cluster as the category center for the next iteration. We repeat the above two steps until the center position of each category does not change anymore. Finally, we can get the reasonable sizes of nine anchor boxes for our NPMD dataset.

Iii Experiments

The experiments in this paper are based on the PyTorch1.8.1+cu101 framework, the programming language is python 3.7, the operating system is Ubuntu18.04, and the GPUs are NVIDIA RTX2080Ti*4, the integrated development environment is PyCharm 2020.3, and the input resolution of the overall network is $416 \times 416$ . Moreover, we employ Adam as the optimizer for training. In the freezing training phase, we train a total of 50 epochs, and the initial learning rate is set to 0.001; In the global training phase, we train a total of 70 epochs, the initial learning rate is set to 0.0001, and the learning rate is adjusted utilizing the cosine annealing decay adjustment strategy.

Iii-a NPMD Dataset Introduction

Currently, there are very few datasets on the detection of mask wearing in natural scenes, the detection environment is too single, and there is a general lack of images in the category of incorrect mask wearing. Therefore, we propose NPMD (Natural Population Mask Detection) dataset, which mainly includes images that meet our requirements crawled by the web crawler, those images selected from public datasets such as MAFA and RMFD which meet our requirements, and some images obtained by extracting frames from online videos. In addition, we have made some images by ourselves to expand the NPMD dataset further. The final original dataset contains 7,854 images in three categories: mask-wearing correctly, without mask-wearing, and mask-wearing incorrectly, involving multiple public natural scenes such as stations, subway stations, and supermarkets. Three common examples of mask-wearing incorrectly include uncovering the nose, uncovering the mouth and nose, and mask on the chin. The dataset is in the Pascal VOC format and annotated employing the LabelImg tool. Because the total amount of images in the original NPMD dataset is relatively small and the number of mask wearing incorrectly categories is also relatively small, we have adopted a variety of random geometric data augmentation, such as affine rotation transformation, Gaussian filtering, random color enhancement, and median blur. The random data enhancement method is utilized to further expand the total amount of images in NPMD dataset. After data enhancement, the total number of images in our NPMD is 11447. The number of each label in NPMD after random data augmentation is shown in Tab. I. In this approach, we can further reduce the risk of overfitting during the model training and improve the robustness and generalization of our proposed YOLOv4.

Category	Number
Mask Wearing Correctly	14185
Without Mask Wearing	7857
Mask Wearing Incorrectly	6703
Sum	28745

TABLE I: The number of labels by three categories in our proposed NPMD (Natural Population Mask Detection) dataset.

Iii-B Evaluation Indicators

The evaluation indicators chosen in our experiments are Average Precision ( $A P$ ) and Mean Average Precision ( $m A P$ ). The $A P$ measures the accuracy of model performance from the accuracy rate $P$ and the recall rate $R$ . The accuracy rate represents the proportion of samples that are actually positive and predicted to be positive to all samples that are predicted to be positive. The recall rate represents the proportion of samples that are actually positive and predicted to be positive to all samples that are actually positive. The general formulas of $P$ and $R$ are shown as Eq. 7 and Eq. 8, respectively.

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

In the abovementioned formulas, $T P$ represents a positive sample detected as correct; $F P$ represents a negative sample detected as a positive sample; $F N$ represents the positive sample detected as a negative sample.

$A P$ is calculated by the integral of the accuracy-recall rate curve. The higher the $A P$ , the better the model performance, and its general formula is shown in Eq. 9.

\emphAP=∫PRdR

(9)

$m A P$ is the average value of APs for each category, which is utilized to measure the average detection accuracy of multiple object categories. The value of $m A P$ can reflect the comprehensive performance of the detectors in all categories. The general formula of $m A P$ is shown in Eq. 10.

m A P = \frac{\sum_{i}^{c} C_{i}}{c}

(10)

Iii-C Ablation Study

We propose a series of ablation experiments to verify the effectiveness of improved YOLOv4. We conduct four sets of ablation studies, and the specific schemes are as follows: the first group, the original YOLOv4, which is deployed as a control to determine whether our improvement component is effective or not; the second group is deployed to determine the effectiveness of the Coordinate Attention Module; the third group is used to determine the effectiveness of the neck network structural improvement; the last group is utilized to determine the effectiveness of K-means clustering for performance enhancement. The detailed results of ablation experiments are shown in Tab. II.

Through the ablation experiments, we can clearly observe that each improvement component positively impacts the improved YOLOv4. The Coordinate Attention Module has significantly increased the AP by 2.67%. Furthermore, we can conclude that the reason for the performance enhancement of improved YOLOv4 is that the Coordinate Attention Module is applied to the front-end of the backbone for guidance, resulting in a more assertive semantic representation of shallow-level feature images. At the same time, the improvement of the network structure further expands the receptive fields of deep-level feature images in disguise and then can extract more profound and richer feature information. The employment of K-means clustering makes the sizes of the nine anchor boxes we set more aligned with our NPMD dataset, further improving the model performance.

Iii-D Comparison with other SOTA Detectors

First of all, the APs of improved YOLOv4 for three categories are 96.12%, 93.99%, and 95.84%, respectively. The mAP of our improved YOLOv4 achieves 95.32%. Compared with the original YOLOV4, the mAP has increased by 4.06% with a little bit of computational overhead.

In order to comprehensively evaluate the performance of our improved YOLOv4, we then compare the improved YOLOv4 with other CNN-based state-of-the-art object detectors on the NPMD dataset. The detailed results are shown in Tab.3. Specifically, “Cat.1”, “Cat.2” and “Cat.3” represent Mask Wearing Correctly, Without Mask Wearing, and Mask Wearing Incorrectly, respectively. We can see that compared with Faster R-CNN, SSD, YOLOv3 and the original YOLOv4, the improved YOLOv4 has significantly improved the APs of the three categories, and the mAP is also better than the abovementioned four SOTA detectors. Our improved YOLOv4 further improves the detection ability of small targeted objects and the processing of detailed features. The effectiveness of our proposed improved YOLOv4 is further confirmed by comparing it with the current mainstream object detectors.

Iv Conclusions

The current situation of COVID-19 is severe and complex, and mask wearing is one of the most effective prevention methods. This paper proposes a new mask wearing detection method based on improved YOLOv4. The extensive experimental results on our NPMD dataset show that the improved YOLOv4 has good accuracy, can better meet the actual needs of epidemic prevention and control, and realize comprehensive and accurate mask wearing detection in natural scenes.

Method	$m A P$ (%)	Parameters	FPS
Original YOLOv4	91.26	64.43M	69.26
+ CA Module	93.93	64.72M	66.82
+ Structural Improvement	94.71	65.48M	64.37
+ K-means Clustering	95.32	65.48M	64.37

TABLE II: The ablation experiment on the effectiveness of each improvement component in our improved YOLOv4. Moreover, we evaluate the model performances in terms of mAP(%) on the NPMD dataset.

Method	AP(%)			mAP(%)
Method	Cat.1	Cat.2	Cat.3	mAP(%)
Faster R-CNN	88.13	89.14	92.11	89.79
SSD [6]	84.72	81.69	94.09	86.84
YOLOv3 [7]	87.07	89.27	93.74	90.02
YOLOv4 [1]	90.29	89.65	93.85	91.26
Ours-YOLOv4	96.12	93.99	95.84	95.32

TABLE III: Performance comparison of different mainstream object detectors on NPMD dataset. Moreover, we train all the models with the same learning schedule and optimizing parameters for a fair comparison.

References

[1] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
[2] Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713–13722.
[3] Jain, Anil K. 2010.Data clustering: 50 years beyond K-means. In Pattern recognition letters. 651–666.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.
[5] Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7263–7271.
[6] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21–37.
[7] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

Research on Mask Wearing Detection of Natural Population Based on Improved YOLOv4 ††thanks: Xuecheng Wu, Mengmeng Tian, and Lanhang Zhai are all with school of Cyber Science and Engineering, Zhengzhou University. (Corresponding authors: Xuecheng Wu.)