Weather Recognition plays an important role in our daily lives and many computer vision applications. However, recognizing the weather conditions from a single image remains challenging and has not been studied thoroughly. Generally, most previous works treat weather recognition as a single-label classifica- tion task, namely, determining whether an image belongs to a specific weather class or not. This treat- ment is not always appropriate, since more than one weather conditions may appear simultaneously in a single image. To address this problem, we make the first attempt to view weather recognition as a multi- label classification task, i.e., assigning an image more than one labels according to the displayed weather conditions. Specifically, a CNN–RNN based multi-label classification approach is proposed in this paper. The convolutional neural network (CNN) is extended with a channel-wise attention model to extract the most correlated visual features. The Recurrent Neural Network (RNN) further processes the features and excavates the dependencies among weather classes. Finally, the weather labels are predicted step by step. Besides, we construct two datasets for the weather recognition task and explore the relationships among different weather conditions. Experimental results demonstrate the superiority and effectiveness of the proposed approach. The new constructed datasets will be available at
1. Introduction
The weather conditions influence our daily lives and production in many ways [1], such as wearing, traveling, solar technologies and so on. Therefore, acquiring weather conditions automatically is important to a variety of human activities. A possible solution to weather recognition is utilizing various of hardwares. While these hardware equipments are usually expensive and need professionals to maintain. An alternative scheme is to recognize weather con- ditions from color images using computer vision techniques [2,3]. Nowadays, surveillance cameras are ubiquitous, which makes the computer vision solution feasible. Apart from the guiding signif- icance to our daily lives, weather recognition is also an impor- tant function to many other computer vision applications [4–7], such as image retrieval [8], image restoration [9], and the relia- bility improvement of outdoor surveillance systems [3]. Robotic vi- sion [10,11] and vehicle assistant driving systems [12,13] can also benefit from the results of weather recognition. Thus, we can draw
a simple conclusion that weather recognition from outdoor images has great research significance.
1.1. Motivation and overview
Although weather recognition is of remarkable value, only a few researches have been published to tackle this problem. Several pre- vious works [12,14–16] concentrated on recognizing weather con- ditions from images captured by in-vehicle cameras. Several other papers [1,17,18] exploited weather recognition from single outdoor images. All of these works referred to weather recognition as a single-label classification task (the weather label means weather category in this paper), namely, determining whether an image be- longs to a specific weather category or not.
However, it is not always appropriate to view weather recogni- tion as a single-label classification problem. There are mainly two reasons to explain this inappropriateness. The first reason can be summarized as uncertainty, i.e., the class boundaries among some weather categories are ambiguous essentially. As can be seen from Fig. 1, the changes from Fig. 1(a)–(f) demonstrate that there are a series of states between a pure sunny weather (like Fig. 1(a)) and an obvious cloudy weather (as illustrated in Fig. 1(f)). It is hard to determine whether the category is sunny or cloudy whe
referring to an intermediate weather state like Fig. 1(c), (d) and (e) [2]. Thus, the uncertainty of such boundary samples causes the dif- ficulty to determine ground-truth labels even from the perspective of human beings, and few previous works present solutions to this problem. The second drawback of treating weather recognition as a single-label classification task can be summarized as incomplete- ness, namely, a single weather label may not describe the weather conditions comprehensively for a given image. For example, the vi- sual effect of haze is obvious in Fig. 1(g), (h) and (i). Nevertheless, it can be seen from the comparisons among these three images that Fig. 1(g) seems more sunny while Fig. 1(h) seems more over- cast, and Fig 1(i) seems snowy. Therefore, only a haze label cannot reveal the differences among these three images.
Motivated by the aforementioned two reasons, we propose to view weather recognition as a multi-label classification problem, i.e., assigning multi-labels to an image according to the displayed weather conditions. Specifically, it is achieved by a CNN–RNN ar- chitecture. The intuition lies in two aspects. On one hand, most of the previous works focused on exploiting hand-crafted weather features [1], [20], while these features did not achieve desired re- sults in the weather recognition task. Inspired by the great success of Convolutional Neural Network (CNN) in these years, we utilize CNN as the weather feature extractor. On the other hand, labels exhibit strong co-occurrence dependencies in weather domain. For example, snowy and cloudy usually occur together while rainy and sunny almost never co-occur. Inspired by the success of Recurrent Neural Network (RNN) in dependency modeling [21,22], we pro- pose to use RNN to model the dependencies among labels and pre- dict weather labels step by step. In such a way, when predicting subsequent labels, the network can refer to the previous hidden states that incorporate the historical information implicitly.
For weather recognition, different image regions have different importances when predicting labels. As shown in Fig. 2, the blue sky is crucial for judging a sunny day, and snow on the ground is significant for estimating the snowy weather. Lu et al. [2] also em- phasized that such weather cues are critical. Therefore, it is nec- essary to make the weather cues discriminative and preserve the spatial information of the image. To achieve this goal, a channel- wise attention model is designed to exploit more discriminative
features for the weather recognition task. Besides, we use convo- lutional Long Short-Term Memory (LSTM) [23] instead of vanilla RNN in our CNN–RNN architecture to preserve the spatial infor- mation. Convolutional LSTM uses convolution operations in both state-to-state and input-to-state transformations, which captures better spatio-temporal information than fully connected LSTM (FC- LSTM) [23].
In addition, considering that there are lacking of datasets in the weather recognition task, two new datasets are constructed in this paper, where the first consists of about 8K images from seven weather categories, it is transformed from an existing transient at- tribute dataset [19]. The second is built from scratch containing 10K images from five weather categories.
1.1. Contributions
In summary, there are three main contributions of this work:
(1) We propose to treat weather recognition as a multi-label classification task by analyzing the drawbacks of classifying images with a single weather label and the co-occurrence relationships among different weather conditions.
(2) We present a CNN–RNN architecture to tackle the multi- label weather classification task. It is composed of a CNN to extract features, a channel-wise attention model to recali- brate feature responses, and a convolutional LSTM to model the relationships among different weather labels.
(3) We build a new multi-label weather classification dataset and transform an existing transient attribute dataset [19] for the weather recognition task. The datasets will be available on the project website.
1.2. Organization
The remainder of this paper is in the following structure: In Section 2, some related works about weather recognition are re- viewed. In Section 3, we describe the proposed approach in detail. In Section 4, we first present the construction of the new multi- label weather image dataset and the modification of the transient attribute dataset [19]. Then, we analyze the experimental results on these two datasets. In Section 5, the conclusion of this paper is drawn.
2. Related work
We roughly classify the weather recognition works into two subcategories in this paper. One category focuses on designing hand-crafted weather features, and another category attempts to use CNNs to solve the weather recognition task.
2.1. Weather recognition with hand-crafted features
Many vehicle assistant driving systems use weather recogni- tion to improve the road safety. For example, they can set speed limit in extreme weather conditions, automatically open the wiper in a rainy day and so forth. Hand-crafted features are popular in these works. Kurihata et al. [12,24] proposed that rain drops are strong cues for the presence of rainy weather and developed a rain feature to detect rain drops on the windshield. Roser et al. [15] defined several regions of interest (ROI) and developed various types of histogram features for rainy weather recognition. Yan et al. [13] utilized gradient amplitude histogram, HSV color histogram as well as road information for the classification task among sunny, cloudy and rainy categories. Besides, several methods are proposed specially for fog detection, Hautiére et al. [14] used Koschmieder’s Law [25] to detect the presence of fog and estimate the visibil- ity distance. Bronte et al. [26] utilized many techniques, includ- ing a Sobel based sunny-foggy detector, edge binarization, hough line detection, vanishing point detection and road/sky segmenta- tion. Gallen et al. [27] focused on night fog detection by detecting backscattered veil caused by the vehicle ego lights or halos around the street lights. Pavli et al. [16,28] transformed images into fre- quency domain and detected the presence of fog through training different scaled and oriented Gabor filters in the power spectrum. Although the aforementioned approaches have shown good perfor- mance, they are usually limited to the in-vehicle perspective and cannot be applied to wider range of applications.
There are also several researches devoted to weather recogni- tion from common outdoor images. Li et al. [29] proposed a pho- tometric stereo-based approach to estimate weather condition of
a given site. Zhao et al. [9] pointed out that pixel-wise inten- sities of dynamic weather conditions (rainy, snowy, etc.) fluctu- ate over time while static weather conditions (sunny, foggy, etc.) almost stay unchanged. They proposed a two-stage classification scheme which first distinguishes between the two conditions then utilizes several spatio-temporal and chromatic features to further estimate the weather category. In [17], several global features were extracted for weather classification, such as inflection point in- formation, power spectral slope, edge gradient energy, saturation, contrast and image noise. Li et al. [18] also utilized several fea- tures in [17], and constructed a decision tree according to the dis- tance between features. Except for regular global features, [1] pro- posed multiple weather cues including reflection, shadow and sky descriptor for two-class weather recognition. They also exploited a collaborative learning strategy in which voters closer to the test image have more weights. Zhang et al. [20,30] proposed the sunny feature, rainy feature, snowy feature and haze feature individually for each weather class as well as two global features. Furthermore, a multiple kernel learning approach is proposed in [30] to fuse these features. In [31], both spatial appearance and temporal dy- namics were investigated on short video clips which can recognize several weather types.
Although researchers have elaborately designed many features for weather recognition, the features are usually limited to specific perspectives or weather classes, and cannot be applied to wider range of applications.
1.1. Weather recognition with CNNs
In recent years, convolutional neural networks have shown overwhelming performance in a variety of computer vision tasks, such as image classification [32], object detection [33], semantic segmentation [34], etc. Several excellent architectures of CNNs are proposed including AlexNet [32], VGGNet [35] and ResNet [36], which outperform the traditional approaches to a large extent. In- spired by the great success of CNNs, a few of works attempt to apply CNNs to weather recognition task. Elhoseiny et al. [3] di- rectly fine-tuned AlexNet [32] on a two-class weather classifica- tion dataset released by Lu et al. [1], and achieved a better result. Lu et al. [2] combined hand-crafted weather features with CNNs extracted features, and further improved the classification perfor- mance. While as discussed in [2], there is no closed boundaries among weather classes. Multiple weather conditions may appear simultaneously. Therefore, all the above approaches suffer from the information loss when they treat weather recognition as a single label classification problem. Li et al. [37] proposed to use auxil- iary semantic segmentation of weather cues to comprehensively describe the weather conditions. This strategy can alleviate the problem of information loss, while the segmentation mask is not intuitive for humans.
2. Our approach
In this paper, to comprehensively describe the weather condi- tions, we propose to treat weather recognition as a multi-label classification problem. Furthermore, a CNN–RNN model is devel- oped for this task, which formulates the multi-label classifica- tion as a step-wise prediction. Fig. 3 demonstrates the architec- ture of the proposed approach. It mainly composes of three parts, i.e., the basic CNN, a channel-wise attention model and a con- volutional LSTM. The CNN extracts the preliminary features of a given outdoor image. Specifically, the first five groups of convolu- tional/pooling layers of VGGNet [35] are utilized in this paper. The channel-wise attention model adaptively calculates the channel- wise attention weights and recalibrates the feature responses. The convolutional LSTM uses visual features and the hidden state to
predict weather labels one by one, which implicitly models the co- occurrence dependency among labels by maintaining context infor- mation in internal memory states.
3.1. The convolutional LSTM in the CNN–RNN architecture
The Recurrent Neural Networks, especially LSTM, has recently achieved overwhelming success in sequence modeling tasks, such as image/video captioning [38] and neural machine translation [39]. Without loss of generality, the LSTM can be formulated as follows [40].
it = σ (Wiwxt + Uihht−1 + bi ), ft = σ (Wfwxt + Ufhht−1 + b f ), ot = σ (Wow xt + Uoh ht−1 + bo ),
gt = tanh(Wgw xt + Ugh ht−1 + bg ), ct = ft ◦ ct−1 + it ◦ gt ,
ht = ot ◦ tanh ct ,(1)
where the subscript t indicates the tth step of LSTM, xt denotes the input data, ht stands for the hidden state, ct is the cell state. it, ft and ot are input gate, forget gate and output gate of the LSTM, respectively. Ws , Us and bs are weights and biases to be learned. σ , tanh and ◦ represent the sigmoid function, hyperbolic tangent function and element-wise multiplication, respectively. As shown in Eq. (1), at each step, the data xt and the previous hidden state ht−1 is taken as the input of current LSTM unit, and the historical information are recorded in the hidden state ht, such that LSTM can exploit the temporal dependency.
Although the standard LSTM has demonstrated its powerful ca- pability in sequence modeling tasks, the spatial information is ig- nored when processing images [23]. As can be seen from Eq. (1),
fully connections are used in state-to-state and input-to-state transformations. Generally, if the input image data xt ∈ RW × H × C , it will be flattened to an 1D-vector before input to the LSTM. While
this process will suffer from the loss of spatial information. To overcome this drawback, the convolutional LSTM is employed in our approach [23], which can be formulated as follows.
it = σ (Wiw 0 xt + Uih 0 ht−1 + bi ), ft = σ (Wfw 0 xt + Ufh 0 ht−1 + b f ), ot = σ (Wow 0 xt + Uoh 0 ht−1 + bo ),
gt = tanh(Wgw 0 xt + Ugh 0 ht−1 + bg ),
ct = ft ◦ ct−1 + it ◦ gt ,
ht = ot ◦ tanh ct ,(2)
where 0 denotes the convolution operator and other symbols are the same with Eq. (1). It should be noted that the input feature xt, cell state ct, hidden state ht and gates it, ft, ot of convolu- tional LSTM are all 3D tensors, and convolution operations are used in state-to-state and input-to-state transformations. Therefore, the spatial information of features are preserved in this way. Further- more, the convolution operation actually has implicit spatial at- tention mechanism, since regions corresponding to the target la- bel usually have higher activation responses. In the experiment, we also find that the convolutional LSTM pays attention to several critical regions for weather label prediction, and achieves better re- sults than common LSTM with or without spatial attention model.
3.1. Channel-wise attention model in the CNN–RNN architecture
Usually, different regions will be activated in disparate chan- nels of the feature map, and different image regions have different importance when estimating various weather conditions. In our CNN–RNN architecture, each step of the convolutional LSTM will predict one weather label. Inspired by Hu et al. [41], we propose a channel-wise attention model for the CNN–RNN architecture to adaptively recalibrate the feature responses when predicting differ- ent weather labels. The illustration of the proposed channel-wise attention model is shown in Fig. 4.
As discussed in [41], exploiting global information is a popu- lar method in feature engineering works. To calculate the atten- tion weight of each feature map channel, we adopt the similar strategy, i.e., global average pooling is used to generate channel- wise statistics which can be viewed as a descriptor of the channel- wise global spatial information. While different from [41], in our multi-label weather classification task, we want to adaptively ob- tain the channel-wise attention weights according to the previous predicted weather label. So we also take into account the channel- wise statistics information encoded in the hidden state of the con- volutional LSTM. The two kinds of statistics information are formu- lated as follows.
where N denotes the number of training samples, pi,t indicates the ground-truth label of the ith sample on the tth weather class, and
~pi,t is the corresponding predicted label. Finally, the total loss is
formulated as follows,
T
Loss = 「 losst ,(9)
t=1
where T represents the number of all weather classes.
3.4. Training details
1 W H
「 「
The open source library tensorflow is used to implement the
ak = fa (xk ) = W
× i=1 j=1
1 W
xk (i, j),(3)
H
proposed approach. To accelerate the convergence, we adopt a two stage training strategy. In the first stage, the basic CNN of our ap- proach (i.e., the first five groups of convolutional/pooling layers of
「 「
dk = fa (ht−1,k ) = WH
i=1 j=1
ht−1,k (i, j),(4)
VGGNet [35]) is trained. Specifically, we transform VGGNet into a multi-label classification framework by replacing the output layer
where xk and ht−1,k denote the visual feature and the previous hidden state of the convolutional LSTM at the kth channel (k = 1, 2, ..., C), respectively. fa represents the global average pooling
function, ak and dk denote the statistics information of visual fea- ture and hidden state at the kth channel. W and H stand for the width and height of visual features. It should be noted that, in our approach, the visual features and hidden states are in the same di- mension.
After the statistics information of the visual features and hid- den states is obtained, the channel-wise attention weights are cal- culated by
with T neurons (T represents the number of weather classes), and train it with multi-label sigmoid cross-entropy loss function. The pre-trained VGGNet model on ImageNet Large Scale Visual Recog- nition Challenge (ILSVRC) is used for fine-tune. In the second stage, we remove the fully connection layers of VGGNet, and fix the other parameters. Then, the convolutional LSTM and channel-wise atten- tion model are trained from scratch based on the CNN extracted features. Xavier initialization method is employed in this stage. Adam [43] optimization approach is used to minimize loss func- tions in both two stages where the first and second momentum are set to 0.9 and 0.999, respectively. To avoid overfitting, the dropout
[44] operation is used after the fully connection layers in both two
zk = σ (w2δ(w1[ak , dk ] + b1 ) + b2 ),(5)
stages, and L2
regularization is also employed for all weight pa-
where ws and bs are weights and biases to be learned, δ represents
the ReLU [42] function that is utilized to learn the non-linear map- ping, [ · , · ] is the concatenation operation, σ indicates the sigmoid function which normalizes the attention weight to the range of 0–
1. Finally, the recalibrated features are obtained by rescaling the original features with attention weights,
C
x˜ = 「 zkxk .(6)
k=1
3.3. Inference
In this paper, the weather labels are predicted in a fixed path. Practically, the order of other weather labels are set accord- ing to their co-occurrence relationships, details are depicted in Section 4.2.
In each step of the convolutional LSTM, the 3D hidden state is flattened to a 1D vector, then it is used to predict the weather label.
pt = σ (wpht + bp ),(7)
where pt ∈ [0, 1] is the output probability of the tth weather label, ht is the flattened hidden state, wp and bp are the learned weight and bias.
The loss of each prediction step is determined by the following function.
rameters. We set the dropout ratio and weight of L2 regularization to 0.5 and 0.0005 during the entire training process. The learning rate is initialized as 0.0001 and drops by a factor of 10 after the loss is stable. Besides, we also attempt to fine-tune all parameters after the second training stage, i.e., unfix the parameters of the ba- sic CNN, while experiments prove that this strategy cannot bring performance improvements.
Before training, each sample is resized into a 256 × 256 image.
Random flip, random crop and random noise are used for data aug- mentation. We adopt the stochastic mini-batch training strategy, images are randomly shuffled and they constitute mini-batches of size 50 before each training epoch. Table 1 shows the detailed shapes of several critical components of the proposed CNN–RNN architecture. Besides, the shapes of all biases can be easily inferred.
4. Experiments
Since this is the first work to treat weather recognition as a multi-label classification problem, there are no existing datasets for this task. Therefore, to evaluate the proposed approach, we construct two datasets where one is the modification of the tran- sient attribute dataset [19] and another one is created from scratch. In this section, we first introduce the construction procedure and details of the two datasets. Then, the co-occurrence relationships among weather labels are explored. Finally, the evaluation metrics,
1 Ncomparison approaches and experimental results are presented in
losst = − 「 pi,t log ~pi,t + (1 − pi,t ) log(1 − ~pi,t ),(8)
4.1. Dataset description
4.1.1. The transient attribute dataset
The first dataset is transformed from an existing transient attribute dataset [19] which was originally erected for outdoor scenes understanding and editing. Although the transient attribute dataset is not specially designed for weather recognition, this dataset presents many appealing properties. First, images are cap- tured across many outdoor scenes including mountains, cities, towns and urban sceneries. Images in this dataset are of different scales and views, which enhances the diversity across scenes. Sec- ond, images are selected elaborately to ensure they exhibit various appearances of the same scene. Moreover, the authors of [19] de- fined 40 transient attributes for this dataset including weather re- lated attributes (e.g., ‘sunny’, ‘rain’, ‘fog’, etc.). For each image, the weather related attributes are annotated non-exclusively, which is important for our multi-label weather recognition experiments. Several examples of the transient attribute dataset are illustrated in Fig. 5.
For weather recognition, six weather related attributes among all 40 transient attributes are selected, i.e., ‘sunny’, ‘cloudy’, ‘fog’, ‘snow’, ‘moist’ and ‘rain’, others are ignored in our experiments. Besides, we find that there exists a few examples in which all weather attribute strengths are very low. Some of them are cap- tured at dawn and dusk, others do not show obvious features cor- responding to any weather categories. Therefore, we add an ‘other’ class to represent those examples where every attribute strength is lower than 0.5. It is noteworthy that the strength lower than
0.5 indicates the annotation workers do not think the image ex- hibits the corresponding attribute. In this paper, for the weather recognition task, weather attributes greater and lower than 0.5 are set to 1 and 0, respectively. Finally, the dataset contains seven weather classes and 8571 images in total. The detailed statistics of the dataset are displayed in Table 2.
4.1.2. The multi-label weather classification dataset
To further evaluate the proposed approach, we construct a new dataset from scratch, which contains 10,000 images from 5 weather classes, i.e., ‘sunny’, ‘cloudy’, ‘foggy’, ‘rainy’ and ‘snowy’. All images are elaborately selected from Internet. Compared to other weather recognition datasets, our dataset has the follow- ing advantages. First, most of the existing datasets focus on only two or three weather classes, while our dataset covers all common weather conditions in the daily life. Second, the new constructed dataset contains many different scenes including cities, villages, ur- ban areas and so on, as depicted in Fig. 6. In addition, this dataset also exhibits different scales and views. Third, in our dataset, the weather labels are not mutually exclusive, which can provide more weather information.
The annotation of multi-label weather classification dataset was completed by a crowd-sourced task. The annotation workers are asked to determine weather attribute strengths non-exclusively for a given image, and the range of strengths is from 0 to 1, in which
0.5 is a demarcation point. Weather attribute strength lower than
0.5 indicates that the image cannot be judged as the corresponding weather condition (even if the image contains corresponding at- tribute). In this dataset, an image is annotated by at least five workers, and the average value of each attribute strength is se- lected as the result. To ensure the effectiveness of the annotation task, we also calculate the variance of each attribute strength for a given image. If the variance is bigger than a threshold, the re- sult will be re-determined by discussion. Finally, to generate the weather labels, all attribute strengths greater than or equal to 0.5 are set to 1, others are set to 0.
Fig. 7 shows the weather label distribution on the two experi- ment datasets. The detailed statistics can also be found at Table 2. In both datasets, cloudy is the class with large number of samples. This is because that cloudy usually co-occurs with other weather conditions. Apart from cloudy, the new constructed dataset is more
balanced than the transient attribute dataset. Besides, it can be ob- served from Table 2 that over half samples have multiple weather labels in both of the two datasets, which also verifies the validity
i and j. Q represents all the samples in the dataset. conc(i, j) and
I(i) are indicator functions which are defined as follows,
of taking weather recognition as a multi-label classification task.
conc(i, j) =
1, Arr(i)0.5 Arr( j)0.5
0,otherweise ,(11)
4.2. Co-occurrence relationships
We have qualitatively argued that more than one weather con- ditions may occur simultaneously in one image. The quantitative
I(i) =
1, Arr(i) ≥ 0.5
0, otherweise
,(12)
analysis of co-occurrence relationships among different weather conditions is also conducted according to the following equation,
}, conc(i, j)
R(i, j) Q,(10)
}, I(i)
Q
where both i and j denote a kind of weather condition, R(i, j) indi- cates the measurement of the co-occurrence relationship between
where Arr(i) denotes the attribute strength of weather condition i,
∧ represents the conjunction symbols. In summary, Eq. (10) indi- cates the ratio between co-occurrence number of the two weather conditions and the occurrence number of weather condition i over
all images. Therefore, }, j R(i, j) and }, j R( j, i) represent the influ-
ence and dependence of label i to others, respectively. To exploit the dependencies when predicting the weather labels, it is natural for us to predict the most influential label first and the dependent label last. Based on this, the following equation is utilized to rank
the weather labels,
},
OR =
},N
n=1
},K
i=1
f ( pn,i, ~pn,i )
,(15)
j R(i, j)
},NK
r = },
j R( j, i)
.(13)
n=1 i=1 pn,i
where N denotes the number of samples in the dataset, K rep-
Obviously, the label with a higher score of r should rank first.
The analytical result is depicted in Fig. 8, from which we can simply draw the following conclusions. First, in accordance with our intuition, there are stronger co-occurrence relationships among different weather conditions, such as rainy and cloudy, snowy and
resents the number of weather classes, pn, i and ~pn,i indicate the
ground-truth label and predicted label of the nth sample on the ith weather class, respectively. f( · ) is an indicator function which is defined as follows,
1,p = p
foggy, etc. The corresponding samples are usually near the cate- gory boundary. In this paper, we propose to use the combination
f ( p, ~p) =
~
0, otherwise
.(16)
of labels to represent these samples. Second, there are indeed la- bel dependencies in the weather recognition task. It is necessary to consider this problem when predicting multiple weather labels. In this paper, the convolutional LSTM is employed to capture the dependencies among different weather labels, and the labels are predicted step by step. According to Eq. (13), the order of weather
labels is fixed as moist → cloudy → others → sunny → snowy
→ foggy → rainy on the transient attribute dataset, and cloudy
→ sunny → foggy → rainy → snowy on our multi-label weather
classification dataset. Practically, we have also tried several other label orders, they get comparable performance, and the above two achieve the best in most occasions.
4.2. Evaluation metrics and comparison approaches
Per-class precision and recall are first computed as evaluation metrics. Per-class means that for a given weather label, the predic- tion result is true as long as the current label is correctly predicted. Then, the average precision (AP) and average recall (AR) are calcu- lated, which are defined as the average values of per-class preci- sion and recall, respectively.
Besides, sample-wise evaluation metrics are also adopted, which are defined as overall precision (OP) and overall recall (OR).
Finally, the F1 scores (including AF1 and OF1) are computed, which are the harmonic mean of precision and recall.
Since there are no other multi-label weather recognition ap- proaches, we compare with the multi-label version of AlexNet [32] and VGGNet [35]. To verify the effectiveness of convolu- tional LSTM and channel-wise attention model in this paper, we also compare with some other CNN–RNN frameworks, including CNN–LSTM, CNN–LSTM with spatial attention model (CLA), CNN– GRU with spatial attention model (CGA), CNN-ConvLSTM without channel-wise attention model. Besides, two widely used general multi-label approaches are also employed as comparison methods, i.e., ML-KNN [45] and ML-ARAM [46]. ML-KNN proposed a multi- label lazy learning method that adapts the traditional K-nearest neighbor (KNN) algorithm to the multi-label purpose. ML-ARAM extended the Adaptive Resonance Associative Map neural network for multi-label classification tasks. In our experiment, we test these two approaches using the implementations of the popular scikit- multilearn library. For fair comparisons, all CNN–RNN frameworks use the same CNN (i.e., VGGNet) with our approach. Features input to ML-KNN and ML-ARAM are also extracted by VGGNet (the last fully connection layer) pre-trained on two experimental datasets. The proposed approach are referred to as CNN-Att-ConvLSTM.
4.3. Results on the transient attribute dataset
N KFor the transient attribute dataset, 1000 images are randomly
}, }, f ( pn,i, p˜n,i )
n=1 i=1
selected for testing, another 1000 images are selected for valida-
OP =
,(14)
N · K
tion, and the remains are for training. The experimental result is
shown in Table 3, from which we can see that the proposed ap- proach CNN-Att-ConvLSTM achieves the best results on OP, OR and OF1, and comparable results with the state-of-the-arts on AP, AR and AF1. CNN–LSTM with spatial attention model (CLA) also gets good results. While without spatial attention model, CNN–LSTM suffers from serious performance degradation. This indicates the importance of some key regions in the weather recognition task. To evaluate the influences of LSTM in the CNN–RNN framework, we also test CNN–GRU with spatial attention model (CGA), and find CGA achieves almost the same results with CLA. CNN-ConvLSTM also gets similar results with CLA, which denotes the effectiveness of convolutional LSTM in information extraction of key regions. Overall, the proposed approach perform better than multi-label version of AlexNet, VGGNet, the general multi-label approaches ML-KNN, ML-ARAM, and other CNN–RNN methods, which proves the superiority of our approach.
For per-class result, all these methods perform worse on ‘rainy’ and ‘other’ classes. This is because that most images in transient attribute dataset present distant views. It is difficult to recognize the rainy weather from such distant views. In addition, samples of ‘other’ class are very rare, and can be easily misclassified as sunny or cloudy in this dataset.
4.2. Results on the multi-label weather classification dataset
For multi-label weather classification dataset, 2000 images are randomly selected for testing, 1000 images for validation, and the remains for training. As presented in Table 4, CNN-Att-ConvLSTM performs the best on almost all the evaluation metrics, which demonstrates the effectiveness of the proposed approach again.
To analyze the effectiveness of our approach, some weather recognition examples are presented in Fig. 9. It includes the clas- sification results, activation maps and attention weights from our approach. The results of VGGNet are utilized for comparison, since our approach also uses it as the deep feature extractor.
Specifically, our approach works well on the above three images. From the selected activation maps and their attention weights, we can see that our approach can attend to the most correlated weather cues when predicting different weather labels, while the results of VGGNet are not so satisfactory. For example, the first image is annotated as sunny and foggy, correspondingly the blue sky, the bright area and the region of haze have stronger responses in our activation maps, and the attention weights of cor- responding activation maps are relatively high when predicting dif- ferent labels. However, the ground is activated by VGGNet mistak- enly, which leads to the wrong label, i.e., rainy. Besides, our ap- proach fails on the rest two images, where the fourth image is an- notated as sunny and cloudy, which means an intermediate state between sunny and cloudy. However, only the cloud regions are activated, and the sunny label is lost in our approach. It is mainly because the sunny label is a little ambiguous. The fifth image is annotated as cloudy and rainy. However, due to the wet ground is not so obvious, it is mis-classified as cloudy and foggy in our approach. Overall, the results in Fig. 9 indicate that our approach performs well in most cases, but sometimes fails when the anno- tation is ambiguous and the weather cues are not obvious. It is reasonable since our approach is just based on the visual features, and maybe better performance can be achieved with other modal- ity information, such as humidity, which can be taken into consid- eration in our future work.
4. Conclusion
Considering that more than one weather conditions may oc- cur simultaneously in one image, we firstly analyze the drawbacks of taking weather recognition as a single label classification task,
and propose a multi-label classification framework for the weather recognition task. It allows one image to belong to multiple weather categories, which can provide more comprehensive description of the weather conditions. Specifically, it is a CNN–RNN architecture, where CNN is extended with a channel-wise attention model to extract the most correlated visual features, and a convolutional LSTM is utilized to predict the weather labels step by step, mean- while, maintaining the spatial information of the visual feature. Be- sides, we build two datasets for the weather recognition task to make up the problem of lacking training data. Practically, the ex- perimental results have verified the effectiveness of the proposed approach.
In the future work, we plan to introduce the distribution pre- diction task for weather recognition [47–50], which cannot only classify the image with multi-labels, but also predict the strengths of different weather class, so as to describe the weather conditions
more comprehensively. Besides, other modality information, such as humidity and temperature, can also be utilized in the future work.