Integrate Memory Mechanism in Multi-granularity Deep Framework for Driver Drowsiness Detection

Driver drowsiness detection is a critical task for early warning of safe driving while existing spatial feature-based methods face the challenges from large variation of head pose. This paper proposes a novel approach to integrate memory mechanism in a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces. The proposed approach includes two steps. First, the spatial Multi-granularity Convolutional Neural Network(MCNN) is designed to utilize a group of parallel CNN extractors on well-aligned facial patches of different granularities, and extract facial representations effectively for large variation of head pose. Furthermore, it can flexibly fuse both detailed appearance clues of the main parts and local to global spatial constraints. Second, the memory mechanism is setup with a deep long-short term memory network facial representations to explore long-term relationships with variable length over sequential frames, which is capable to distinguish the states with temporal dependencies, such as blinking and closing eyes. The proposed approach achieves 90 . 05% accuracy and about 37 FPS speed on the evaluation set of the National Tsing Hua University Driver Drowsiness Detection (NTHU-DDD) dataset, which is applied on the intelligent vehicle for driver drowsiness detection. A dataset named Forward Instant Driver Drowsiness Detection (FI-DDD) is also built and will be public accessible to speed up the study of driver drowsiness detection.


INTRODUCTION
Driver drowsiness is a critical problem which induce the 6% of serious road accidents each year [1] .Driver drowsiness indicates that the driver lacks sleep, which can be detected by the variation of physiological signals [2][3][4][5] , vehicle trajectory [6,7] , and facial expressions [8] .Drowsiness detection using vehicle-based, physiological, and behavioral change measurement system is possible with embedded pros and cons [9] .Subjective technique cannot be used in a real driving situation but is helpful in simulations for determining drowsiness.Psychological signals like electrocardiogram, electroencephalogram (EEG), and Electrooculography can be utilized for drowsiness detection.Vehicle-movement-based detection is another technique.Here information is obtained from sensors attached with steering wheel, acceleration pedal or body of the vehicle.Signals collected from sensors are continuously monitor for the identification of noticeable variation in order to detect driver drowsiness.Drowsiness never comes instantly but appear with visually noticeable symptoms.These symptoms generally appear even well before drowsiness in every driver.Moreover, drowsiness can be reflected by facial expressions, such as nodding, yawning, and closing eyes.We therefore aim to develop a drowsiness detection method based on video.Video-based method is possible to give the warning prompts and receive the driver's feedback in time, being of great value in practice.
Video-based drowsiness detection is still full of challenges, mainly stemming from the illumination condition change, head pose variation, and temporal dependencies.In particular, the large variation of head pose causes serious deformations of facial shape, which makes it difficult to extract effective spatial representations.Conventionally, approach based on aligned facial points [8] is a better way to represent drowsy features, however, ignoring temporal relationships means it cannot distinguish blinking and closing eyes.Spatial-temporal descriptor [10] is proposed to collect spatial and temporal features but not good at distinguishing states with long-term dependencies, such as yawning and speaking.Besides, these handcrafted descriptors are not enough powerful to describe large variation of head pose and classify confusing states, e.g., looking aside and lowering head lead to large pose variation, while yawning and laughing is similar but belong to different states.
Recently, deep learning methods are widely used to learn facial spatial representations automatically from global face [11][12][13] .The global face without well alignment is weak to provide effective representations for large pose variation.Moreover, it is not flexible to fuse the configurations of local regions and concentrate representations on the most important parts such as eyes, nose, and mouth on which the majority of drowsy information focuses.It is another challenge to distinguish easy-to-confuse states, such as blinking and closing eyes.3D-CNN with fixed time windows [11] tried to describe spatial and temporal features, but it does not have enough capability to model long-term relationships with variable time length.We propose a Long-term Multi-granularity Deep Framework (LMDF) to detect driver drowsiness from well-aligned facial patches.Our method applies alignment technology to obtain the well-aligned facial patches over frames, and these patches mainly locate in the informative regions that supply critical drowsy information.A group of parallel convolution layers are applied on the multi-granularity facial patches, and the outputs of these layers are fused by a fully connected layer to generate spatial representations, which is named as Multi-granularity Convolutional Neural Network (MCNN).MCNN is able to fuse appearance of those well-aligned patches and capture local to global constraints.To explore temporal dynamical characteristics, we fuse a memory mechanism to the MCNN, the Long Short-Term Memory (LSTM) network is applied to the spatial representations over sequential frames, which can distinguish the confusing states with temporal relationships, such as yawning, laughing, blinking, and closing eyes.The proposed method can thus not only extract effective facial representations from single-frame images, but also mine temporal clues from videos.
As shown in Fig. 1, the spatial and temporal features are extracted and concentrated to detect the driver drowsiness.The contributions of our approach are mainly in three aspects: (1) We propose MCNN to learn the facial representations from the most important parts, which makes the detector robust to large pose variation.(2) We propose a LMDF to learn facial spatial features and their long-term temporal dependencies.(3) We build a FI-DDD dataset with higher precision of drowsy locations in temporal dimension, which is a good test bed for evaluating practical systems that are required to detect drowsiness in time.

Traditional driver drowsiness detection methods
Driver drowsiness detection is becoming a hot topic of Advanced Driver Assistant System.Many traditional methods are applied to deal with this problem.The change of pupil diameter was utilized by Shirakata et al. [14] to detect imperceptible drowsiness, which is effective but it is not convenient for a driver to take the equipment.Nakamura et al. [8] utilized face alignment to estimate the degree of drowsiness via k-NN, which cannot achieve online performance.Spatial-temporal features for driver drowsiness detection were proposed by Mahdi et al. [10] , which were based on Hough transformation, cannot work well in practical driving environment. [15]proposes a method for detecting the driver drowsiness based on a time-series analysis of the steering wheel angular velocity.It proposes using the temporal detection window to determine the steering wheel angular velocity during a time-series, at which time appear specified indicators of the driver drowsiness.Besides, the representations of those methods are hand-crafted, which may be not flexible to adapt to complex situations faced in driving, while our method automatically learns facial representations, which is more effective to the practical task.
In earlier research, to estimate the level of drowsiness, the measures to be focused on are single measures either vehicle-based, physiological measure, behavioral measure or subjective measure.Researchers have reported good results using numerous less intrusive techniques to detect the drowsiness of drivers, including eyelid movement and gaze or head movement monitoring.Rumagit et al. [16] investigated the relationship between the drowsiness and physiological condition by utilizing an eye gaze tracker and the Japanese version of the Karolinska sleepiness scale within the driving simulator environment.Amirudin et al. [17] analyzed two single measures which includes physiological and behavioral measures such as EEG signals and video sequences.Jakubowski et al. [18] presented an approach using convolutional neural networks and transfered learning technique.And that paper presents the results of the scientific investigations which aimed at developing the detectors of the selected driver fatigue symptoms based on face images.

CNN and RNN based methods driver drowsiness detection methods
Deep learning approaches such as CNN have achieved success in representing information on images [19][20][21] and has been widely used in the field of machine learning [22] .The use of CNN models for image classification can avoid the problem of high complexity and difficulty in feature extraction in traditional classification methods, and is therefore increasingly applied in facial recognition [23] .Compared to traditional image classification methods, deep learning methods can use a large number of datasets for training and learn the best features to represent these data, better responding to changes in the real world [24] Recently, many researchers also applied CNN on driver drowsiness detection.Park et al. [25] combined the results of three existing networks by SVM to present the categories of videos.Later, [26][27][28][29] have also introduced various improved convolutional neural networks to capture facial regions under complex driving conditions to classify videos.However, those models can only classify videos into different categories, they cannot detect driver drowsiness online.3D-CNN is applied to extract spatial and temporal information by Yu et al. [11] , and the method can only capture features with fixed temporal window.The above two methods utilize global face image, which cannot flexibly configure those patches containing the majority of drowsy information.Moreover, they are hard to capture dependencies with variable temporal length.
Due to the well performance of Long Short Term Memory Networks (LSTMs) on sequential data [30][31][32] , more and more researchers propose combinations of CNN and LSTMs to learn spatial and temporal representations of sequential frames.It is interesting that Liang M. et al. [33] came up with convolutional layers with intra-layer recurrent connections to integrate the context information for object recognition.Jeff D. et al. [34] provided a method which extracts visual features from images by CNN and learns the long-term dependencies from sequential data by LSTMs.Especially, the approach of Jiang W. et al. [35] and Jeong J. H. et al. [36] processes image with CNN and models sequential labels by LSTMs concurrently, and then combines the two representations via projection layers.However, none of the above methods apply multi-granularity method to concentrate representations on important parts and flexibly fuse configurations of different regions.

Multi-granularity methods
Fine-grained methods mostly rely on object detection [37][38][39] , classifying all regions after identifying areas that may contain objects.Coarse-grained methods extract and encode overall image features through convolutional networks [40] or vision transformer [41] , which can eliminate the interference of fine details, but their performance is often inferior to fine-grained methods.The multi-granularity methods combine coarse and fine granularity to capture discriminative spatial and temporal information at different semantic levels [42] Recently, Multi-granularity methods have achieved several excellent results in some applications of computer vision.Qing Li et al. [43] proposed temporal multi-granularity approach on action recognition.Their method achieved the state-of-the-art performance on action benchmarks, but cannot capture detailed appearance clues and local to global spatial information.Dong C. et al. [44] applied multi-scale patches based on face alignment on face recognition.Dequan W. et al. [45] utilized multi-granularity regions, detected by three granularities convolutional neural network, to generate multi-granularity descriptor for fine-grained categorization, but this method cannot process sequential frames.Rui H. et al. [46] proposed a multi-granularity extraction sub-network which extracts more efficient multi-granularity features while compressing the parameters of the network, and a feature rectification sub-network and a feature fusion sub-network to adaptively recalibrate and fuse the multi-granularity features.
Finally, an LSTM network is applied to distinguish actions with similar appearances.However, this method cannot focus on the most significant regions to get the most precise result and speed up the inference.Different from the above, our method can capture spatial multi-granularity information and long-term temporal dependencies.Particularly, our MCNN can learn representations on the most significant regions from well-aligned multi-granularity patches, and the proposed method has achieved the state-of-the-art accuracy on NTHU-DDD [47] dataset for driver drowsiness detection.

METHODS
The proposed method utilizes MCNN to learn facial representations from single-frame image.The representations, extracted from well-aligned multi-granularity facial patches, contain both detailed appearance information of the main parts and local to global constraints.Furthermore, our approach takes advantages of a deep LSTM network to explore dynamical characteristics of the facial representations from sequential frames.The detailed structure of our LMDF combining MCNN and LSTMs is shown as Fig. 2.

Well-aligned Multi-granularity Patches
It is well known that drowsy information is focused on several main facial parts such as eyes, nose, and mouth.
Alignment provides an excellent way to extract well-aligned features over frames, which effectively represent facial drowsy states.Besides, global patch provides rough information to estimate the states of a driver's head and full face, which assists the decision of driver's drowsy states when the locations of parts are not precise.Our method takes advantages of local regions and global face at the same time.
We utilize face alignment technology to locate facial shape points.Given an image   with a face in the -th frame, we detect landmark points of facial shape   via regressing local binary features proposed by Ren et al. [48] .From those points, it is convenient to get the locations of main parts and important local regions.According to center points and specific sizes of all regions, we crop those patches from the original image and resize them into the same size, which are the well-aligned multi-granularity patches as the input of the CNN.
Those patches, including local regions, main parts, and the global face, are produced by three different mappings.Shown as Fig. 3, a mapping Φ   can select center points of eyes, nose, and mouth from facial shape   , and crop patches of those parts from the input image   with given sizes   .And the mapping still needs to convert the patches into a unified size   .Thus the single-granularity patches of those main parts I   are generated.The operations of mapping Φ   and Φ   are similar to the mapping Φ   , while the differences lie in the locations and sizes of regions.The mapping Φ   selects the corners of the eyes, mouth, and the sides of the nose as the interest of regions with size   and output local patches I   .A global facial region with size   is chosen by the mapping Φ   which finally produces a global facial patch I   .Formally, the mappings are represented as Processing the input image   by the three mappings, we can obtain a set of well-aligned patches I   consisting of the main parts, local regions, and global face, it is presented as where I  ,: ,  ∈ {, , } represents all elements of a patch set I   .
Compared with the original image, the patches set I   , including both detailed appearance clues of parts and rough information of full face, have more advantages to describe the facial states.Meanwhile, the relations between local and global regions are implied, which are the basis of mining useful features.Therefore, we take the set of patches I   as the input of CNN to learn effective representations.

Learning Facial Representations
Our approach learns representations by convolutional neural network but not hand-crafted for its well performance in learning spatial features.We apply several convolutional layers to process each one in the set of patches I   independently.To fuse the information of all patches, a fully connected layer is arranged after all convolutional operations, which generates -dimensional descriptors combining local and global clues.
Every patch needs to be processed by convolutional operations at first.For a patch   , , the -th one of patch set I   with length , three convolutional layers are utilized to capture the spatial features, shown as Fig. 4. The first one is made with convolution and Rectified Linear Units (ReLU) activation followed by a max-pooling operation, which projects a normalized 3-channel image to a higher dimensional representation.Only convolution and ReLU activation are selected in the second layer to enlarge the dimension of representation sequentially.And the structure of the third convolutional layer is similar to the first layer but with different parameters to decrease the dimension.A representation x   of the patch I  , can be generated by a mapping Φ  consisting of those convolutional layers with parameters    , which is presented as where    is the -th element of convolutional parameter set   .A fully connected layer is utilized to combine those representations extracted by the mapping Φ  from the set of patches.But before combining operation, we concatenate those representations into a long vector x   , formed as With a specific weighted matrix W   and bias vector b   , the combining -dimensional representation x  can be presented by the fully connected layer as in which 0 is a zero vector.
The descriptor x  contains not only detailed appearance information implied in every part, but also the constrained relations between local regions and global face.The effectiveness of the descriptor can be improved by appropriate objective functions and proper training methods.
Driver drowsiness detection is a binary classification problem, thus the state of an input frame is just drowsy or not.We label drowsiness with 1 as the positive sample and normal state with 0 as the negative sample.And a label  are expressed with a one-hot vector y  , such as a vector [0, 1] means the positive label.
To train the parameters of the convolutional neural network, we project the representation x  into the probabilities of each category  ∈ {0, 1}, by another fully connected layer with weights W   , a bias vector b   , and the probability vector p(∥x  , W   , b   ) is normalized via a softmax layer.The cross entropy which can indicate the correct rate of classification is selected as the objective function, and we utilize the Adam optimizer to train the whole CNN.The visual representations can also be generated by the convolutional layers and the first fully connected layer.

Exploring Dynamical Characteristics
The representation x  is extracted in a frame, while whether a driver is drowsy or not is judged by a certain period.We apply LSTMs to model the temporal dynamical characteristics of spatial representations on driver drowsiness detection.A LSTM block consists of an input gate, a forget gate, an output gate, and a memory cell.Because of the three gates, the LSTM block can learn long-term dependencies in sequential data and its parameters are easier to be trained.The memory cell can store long-term information in its vector, which can be rewritten or done other operations for the next time step.Besides, the number of hidden units should be chosen according to the dimension of the input representation x  .
We employ multiple layers LSTMs to mine the temporal features for driver drowsiness.A mapping Φ  containing three layers LSTMs with parameters   is utilized to explore temporal clues of the representation x  generated by MCNN extractor and presents the hidden states h  3 of the third layer as a representation containing temporal dependencies, which is presented as where is the parameter set of these LSTM blocks in the last step.
A fully connected layer with weight W  and a bias vector b  is used to project the output of the mapping Φ  into a two-dimensional vector that is then decoded by softmax operation to the probabilities p(∥h  3 , W  , b  ) of the two categories.To solve the parameters, we take advantage of Adam optimizer to train the LSTMs with cross-entropy objective function.
The label   of the current frame can be predicted as the class with the maximum probability.
Similarly, the labels y of the sequential data can be obtained.

EXPERIMENTS
NTHU-DDD dataset is provided on the challenge of ACCV2016 workshop for driver drowsiness detection, on which we compare our approach with others.To make the sequential labels close to the practical driving environments, we relabel the video set with instant detecting principle.A new dataset is generated from the relabeled video set and it is called FI-DDD on which we learn parameters and analyze the performance of several subnetworks.While the performance of our entire approach is evaluated on the original NTHU-DDD dataset, we thus train a set of parameters to achieve long-term memory performance.Finally, the accuracy 90.05% is obtained by our LMDF on the evaluation set of the NTHU-DDD dataset, and the proposed method achieves about 37 FPS on GPU Tesla M40.

NTHU-DDD Dataset:
The NTHU-DDD dataset includes five scenarios listed as glasses, no glasses, glasses at night, no glasses at night, and sunglasses.The training set involves 18 volunteers consisting of 10 men and 8 women who act as drivers with four different states in every scenario, while the evaluation set has four volunteers including two men and two women.Non-sleepy videos contain only normal state, while sleepy videos combine normal and drowsy states together.Besides, blinking with nodding and yawning videos only record drowsy eyes and mouth respectively.NTHU-DDD dataset offers four annotation files recording the states of drowsiness, eyes, head, and mouth for every video.Table 1 gives the labels of drowsiness and three main parts.It is worth emphasizing that the labels on NTHU-DDD dataset are long-term memory, which means that the states of a frame may depend on the frames in the previous several seconds.

FI-DDD Dataset:
A problem comes due to the long-term memory in NTHU-DDD, which is that a driver would still receive the warning prompts even if he had revised his drowsy states to the normal for a few seconds.At the same time, those labels are unable to locate the drowsy states with high precision in temporal dimension.To solve these problems, we relabel those videos with instant principle, which means the latency is limited within 0.5 second namely 15 frames for 30 FPS videos.Those typical states, such as closing eyes, yawning, and lowering head, are still considered as one of the pieces of evidence to judge whether a frame is drowsy.Those videos are cut into several clips which contain only the drowsy or the normal states alternatively according to our labels.To describe the transitional states between the normal and the drowsy, we reserve ten normal frames at the head and the tail of every clip with drowsiness.We name the relabeled dataset with FI-DDD which includes 14 drivers on train set and 4 ones on test set.As shown in

Implementation Details
Face Alignment: We apply face alignment technology to locate those facial shape points for all videos.Face detection and tracking are combined to increase detecting rate and provide more accurate positions for faces on videos.Face alignment algorithm is based on those face positions.The face detector is from OpenCV and the approach of face tracking is proposed by Danelljan et al. [49] .We implement the method of Ren et al. [48] and retrain the model, and preprocess all videos to obtain the 51 landmark points for every frame.Those frames with no face will be recognized as the empty and filled with zero coordinates for landmark points.To compare with the previous methods, we evaluate the proposed method on the evaluation set of NTHU-DDD dataset.

Experimental Analysis
As shown in Fig. 6, the spatial and temporal features are extracted to detect the driver drowsiness.The detection results of driver under normal condition is shown in Fig. 6(a), and the detection result of driver drowsiness is shown in Figure 6 (b), with the detected features marked with red color.In the experiments, there are two kinds of videos with the different positions of camera, while the proposed methods can work effectively both.
To further explain the effects of alignment, multi-granularity, and CNN extractor, several groups of experiments are conducted on the static image set.We also provide experiments on FI-DDD dataset to verify the effectiveness of LSTMs for detection drowsiness on video.

The Importance of Alignment
It is essential to carry out experiments to explain the significance of alignment and the effects of locating precision.

None-alignment 𝑣𝑠 With Alignment:
We provide another two none-alignment methods to sample those multigranularity patches in facial bounding box: Uniform Sampling (US) and Specific Sampling (SS).The corresponding sizes of our Aligned sampling (AS) method and the two none-alignment ones are the same.Fig. 7(Left) shows the comparison of AS, US and SS.AS considering alignment achieves the best accuracy 87.4% on the test set of the static image set, which is 4.9% higher than SS method and 6.2% higher than US one.As a conclusion, alignment of facial patches, providing aligned representations, is an effective way to improve the accuracy on driver drowsiness detection.

Effects of alignment precision:
We evaluate the effects of the alignment precision, and research the influence quantitatively by adding random noise with Gaussian distribution  (0, ) over the well-aligned facial points.Fig. 7(Right) shows the results on test set of the static image set, from which, we discover that the accuracy is decreasing with the increasing standard deviation of noise and even less than 80% if  ≥ 10 px.While the accuracy is more than 83% with  less than 5 px, we make a conclusion that the proposed MCNN is robust to the corrupted locations if  ≤ 5 px.

The Effects of Multi-granularity Patches
Multi-granularity patches consist of local regions, main parts, and global face.It is significant to conduct experiments and explain the importance of those granularities on driver drowsiness detection.We apply a fully connected layer and softmax operation to classify representations presented by MCNN extractor, and analyze the effects of

Learning curve on different granularities:
We take four different granularities, listed as local regions, main parts, global face, and the combination of the above, into account to analyze the effects of multi-granularity facial patches.Fig. 8 illustrates the comparisons of those granularities, from which, we know that the convergent speed of method with global face granularity is the slowest compared with the others, and that of local regions is the fastest.While multi-granularity method achieves good performance on both convergent speed and accuracy.Aligned points can achieve higher precision on those local regions with abundant boundary texture, which results in more aligned representations and easier being classified.Nevertheless, multi-granularity patches containing more aligned information are more effective on driver drowsiness detection.

Effects of positions and sizes:
We change the positions and sizes of facial patches respectively.Shown as Fig. 9(Left), facial main parts, including eyes, nose, and mouth, obtain the best accuracy 83.6% compared with the other single-granularity method.Obviously, the combination of those three granularities achieves the best accuracy

87.4%.
A conclusion comes that the most effective representation is extracted from the three main facial parts, while the fusion of local and global clues is an excellent way to obtain better facial representations.
We set their sizes as the same and change the sizes to research the difference between single-size and multi-granularity methods, keeping the locations of these patches invariable.Fig. 9(Right) shows different regions with different sizes achieve 2.3% accuracy more than that of those single-size patches.The phenomenon is the result of that different physiological parts are of different sizes, e.g., the size of global face is bigger than single eye.The above analysis presents that multi-granularity method is an effective way to represent facial features.

The Parameters Selection of MCNN Extractor
The structure parameters of the convolutional layers are listed as Table 2.A patch with size 64 × 64 processed by those convolutional layers is projected to a tensor with size 16 × 16 × 4.And a representation of the patch is generated by reshaping the tensor to a 1024-dimensional vector, which is the input of a fully connected layer.A fully connected layer is applied to combine the multi-granularity clues and generate MCNN representations.The number of its hidden units , namely the dimension of representation, has effects on the combination of those patches.Changing the number of hidden units , we explore the relations between the dimension of MCNN representations and classification accuracy with well-aligned multi-granularity facial patches.The comparison of different dimensions is shown as Fig. 10, which indicates that the number of dimensions almost has no influence on the convergent speed.But 256-dimensional representations achieve the highest accuracy.Therefore, it is reasonable for us choose the number of hidden units as 256.Traning times/(100 times)

The Significance of LSTMs
We first apply MCNN to detect driver drowsiness on video, but it has no capacity to capture the temporal clues.
MCNN+LSTMs is considered to deal with this drawback.It is necessary to compare the situation with LSTMs [50] and that without LSTMs for understanding the effects of LSTMs.All experiments at this part are carried on the FI-DDD dataset in day-time scenarios.The parameter settings and adjustments follow the settings in [51] .

Parameters setting:
The representations given by MCNN extractor are 256-dimensional, and the number of hidden units in each LSTM block is equal to 256.The forget gate is enabled and the max memory step is set to 60 frames.We randomly select a batch with 1000 samples to train the LSTMs parameters with learning rate 3 −4 .The fully connected layer projects the states of the last LSTM block to a 2-dimensional vector which is decoded to the probability of drowsiness by a softmax operation.

MCNN-Only 𝑣𝑠 MCNN+LSTMs:
The experiments are carried on four different granularities to research the effects of multi-granularity and LSTMs.Fig. 11 shows the accuracy of MCNN only and MCNN+LSTMs for detecting videos on test set under different granularities.MCNN-Only method obtains 72.7% accuracy, while the accuracy achieved by MCNN+LSTMs is 15.6% more than that by MCNN-Only.The reason is that the LSTMs have ability to mine the clues in temporal dimension which is significant for recognizing lots of ambiguous states, such as closing eyes and blinking.Comparing the accuracies of different granularities, we discover that the well-aligned multi-granularity facial patches still achieve the best performance.The accuracy of the main parts ranks the second, which means the granularity of main parts certainly plays the most important role in improving the effectiveness compared to the other two granularities.

Comparisons with the Previous Methods
We evaluate the whole method on the evaluation set and compare with the previous methods [11,13,25,52,53] achieved on the same dataset.Due to the long-term memory characteristics on NTHU-DDD dataset, the max memory length is set to 120 frames and other parameters keep the same as the above experiments.Especially for night scenarios, we retrain a model with the night data of NTHU-DDD to detect driver drowsiness on near-infrared videos.
Accuracy: Table 3 presents the comparison of our method, the previous work [11,13,25,52,53] , and the proposed method achieves 90.05% accuracy, which is significantly improved compared to other existing methods, as the state-of-the-art method of driver drowsiness detection.
Table 3.The comparison of different methods on the evaluation set of NTHU-DDD dataset with the detailed information of environments.

Platform
Spatial Features Sequential Features Speed Accuracy Yu et al. [11] GPU 3D-DCNN feature fusion 24∼32 fps 72.60% Park et al. [25] -DDD Network SVM -73.06% Yu et al. [52] GPU 3D-DCNN feature fusion 38.1 fps 76.2% Wang et al. [53] -CNN LSTMs 40.64 fps 82.8% MSTN [13] -CNN LSTMs 60 fps 85.52% Ours GPU-M40 MCNN LSTMs 37 fps 90.05% Speed: Table 3 shows the performance comparison of our method with other existing methods.The proposed method achieves a speed of 37 FPS on the GPU platform, satisfy the real-time performance requirementsand and exceeds the majority of existing methods, second only to the methods proposed by Yu et al. [52] .But our method has significantly improved accuracy compared to their methods.At the same time, We measure time consumption of all modules of our proposed method.From Table 4, CNN costs the most time and the approach achieves about 3 FPS on CPU platform.

CONCLUSIONS
This paper proposes a novel approach to integrate memory mechanism in a multi-granularity deep framework to detect driver drowsiness, and the temporal dependencies over sequential frames are well integrated with the spatial deep learning framework on the frontal faces.First, the spatial MCNN is designed to utilize a group of parallel CNN extractors on well-aligned facial patches of different granularities, and extract facial representations effectively for large variation of head pose.Second, the memory mechanism is setup with a deep LSTM network facial representations to explore long-term relationships with variable length over sequential frames, which is capable to distinguish the states with temporal dependencies.The proposed method is evaluated on the NTHU-DDD dataset and achieves 90.05% accuracy and about 37 FPS performance, as the state-of-the-art method on driver drowsiness detection.Moreover, a new dataset named FI-DDD is built with higher precision of drowsy locations in temporal dimension.This dataset performs well in training model parameters and analyzing effects of several factors, and will be provided publicly to speed up the study.
Although our method has achieved good performance on existing datasets, there are still complex conditions and uncertain factors in real-world scenarios, such as significant changes in lighting and occlusion of the driver's face.
In future research, we will consider applying the method proposed in this paper to real-world scenarios and continue to explore how to improve the generalization of the method under complex conditions.

DECLARATIONS Authors' contributions
Made substantial contributions to conception and design of the study and performed data analysis and interpretation, then wrote the paper: Handan Zhang, Tie Liu, Jie Lyu, Dapeng Chen, Zejian Yuan.

Figure 1 .
Figure 1.The examples of driver drowsiness.(a) The normal status of driver.(b) The driver drowsiness status.The spatial and temporal features are extracted and concentrated to detect the driver drowsiness.

Figure 2 .
Figure 2. The long-term multi-granularity deep framework for driver drowsiness detection.The first stage is well-aligned multi-granularity patches which consist of local regions, main parts, and global face.Parallel convolutional layers are well-applied to process patches respectively, fully connected layer fuses local, global clues, and generates a representation, which is the second stage of the framework.The first two stages construct the MCNN.Recurrent Neural Network (RNN) with multiple LSTM blocks mining the clues in temporal dimension together with a fully connected layer form the third stage.

Figure 3 .
Figure 3.The procedure of extracting multi-granularity facial patches which include the main parts, local regions, and global face such three granularities.

Figure 4 .
Figure 4.The three layers to capture the spatial features.First layer is to projects a normalized 3-channel image onto a higher dimensional representation; Second layer is to enlarge the dimension of the representation; Third layer is to decrease the dimension with different parameters to first layer.

Fig. 5 .
The train set of FI-DDD at daytime has 157 clips and the test set has 92 clips, while at night scenarios, the train set has 126 clips, and the test set has 75 clips with about 530 frames on average.
Samples of participants (c) Same scenarios and different performance

Figure 5 .
Figure 5. FI-DDD Dataset.(a)A sample of the normal state of the participant.(b)The sample of different individuals; every participant sits in a static vehicle when they record their videos.(c)The different states of a participant's facial performance, eye closing, nodding, speaking, and laughing.
-granularity: We obtain Multi-granularity patches considering two factors: different positions and different sizes.We design to choose 15 positions from facial shape points, which are divided into three granularities: 1 global face with size   = (160 × 160), 4 main parts with size   = (64 × 64), and 10 local regions with size   = (32 × 32).The specific locations of all patches are shown as Fig.3.Before sent to CNN, those patches are resized to size   = (64 × 64), normalized to [-0.5, 0.5], and are converted to 3 channels to ensure that our framework can process RGB data.Dataset Usage: A static image set, required for training the CNN parameters, are sampled from the videos of FI-DDD with a specific frame interval.The result of CNN is directly related to multi-granularity patches and CNN parameters, we thus analyze the effects of those factors on the static image set.While all experiments for analyzing the effects of LSTMs parameters are carried on FI-DDD dataset.

Figure 6 .Figure 7 .
Figure 6.The examples of driver drowsiness detection.(a) The normal status of driver.(b) The detection result of driver drowsiness, while the detected features are marked with red color.

Figure 8 .
Figure 8.The comparison of different granularities, global face, main parts, local regions, and multi-granularity patches.The Curve of Acc over training times are achieved by CNN with different granularities on test set of the static image set.

Figure 9 .
Figure 9. Left: The comparison of patches with different positions, GF-the global face, MP-main parts (eyes, nose, and mouth), and LR-local regions(the corner of eyes, the sides of nose, and the boundary of mouth); Right: The comparison of patches with different sizes at all locations.Mg represents multi-granularity patches.