Interpreting traffic signs is an essential part of driving for every human driver. Inevitably an autonomous vehicle has got to understand traffic signs just like humans to react appropriately. The second Udacity Self Driving Car Engineering Nanodegree project focused on the interpretation aspect, the classification of traffic signs.
To do this a convolutional neural network was to be built and trained to decide for each provided image which traffic sign out of a set of 43 different german traffic signs the presented image fits best.
A neural network is a collection of connected neurons. There can be, and often are, several interconnected layers which learn from training data to make conclusions to reach the same solution or get as close as possible to what is provided as correct solution with the input. E.g. if a neural network is shown an image of a car and given the label “car” then, over training iterations, the neurons learn to respond to certain features in the input to collectively generate the label “car”. Its success is evaluated by a scoring function and the neurons sensitivities are adjusted to optimize the score.
In a convolutional layer of a neural network we exploit that in images the location of a feature does not change its meaning. I.e. a car is a car regardless no matter where it is in the image. To do this a convolutional layer will look only a small area of the input at a time. Directly applied to the input image this would pick up features like edge/lines of different orientation or spots of a certain colour. In convolutional layers deeper in the neural network would combine these small subunits to make out patterns. This expands until in the end concepts like an entire car, animals or, in this case, traffic signs are recognized. An in-depth and highly commendable resource on neural networks is the Stanford course on neural networks (http://cs231n.github.io/convolutional-networks/).
Back to the problem at hand. Udacity provided a training dataset of about 35000 labelled images of traffic signs as well as test and validation sets. The dataset is split that way to avoid evaluating the model on data it has already seen. For this reason the validation set is used when iterating to improve the model to evaluate it. Over many iteration the characteristics of this validation set bleed into the model and another independent dataset is required to confirm the effectiveness of the model. This dataset is the test dataset.
All the dataset images have a resolution of 32 by 32 pixels and 3 (RGB) colour channels. The labels are numbers which correspond to the traffic sign classes.
The plan for this problem was to first build a simple neural network and train it on the input images. The first attempt used the simple LeNet 5 Architecture. It is a small neural network which works well and does not require long to train.
This led to a decent accuracy straight away with ~89% of the time predicting the traffic sign right against the validation set. To improve this further the training data was normalized in two ways. The first is histogram normalization. In short histogram normalization aims to fully utilize the range of possible values in colour schemes. For a standard RGB image each pixel will have a value from 0-255. Often images do not make full use of this range and e.g. mainly use 50-150 then histogram normalization stretches out existing colours to use the full range improving the visible contrast.
The second normalization is specifically for the maths operations training the neural network. If the nodes of the neural network can operate with values close to 0 this helps to converge more quickly in training (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
Those two measures improved the accuracy to 95% and above the 93% accuracy required for the project to get accepted. To take it one step further I looked into data augmentation. Data augmentation extends the available training dataset by adding images generated from existing ones, but with small alterations. Generally this augmentations are done which resemble realistic permutations. For traffic sign images they could be rotated or tilted as the signs could be on a slope or the sign photo could be taken from different angles. Different lighting conditions could be simulated by changing the brightness of the image randomly or fake shadows can be drawn onto the image. These permutations help the built model to generalize better as they learn to cope with various conditions.
In this scenario I slightly rotated images randomly. This was combined with an improved network architecture combining an inception module, which combines a few different convolutional elements of different receptive field size in parallel. Instead of choosing this manually the network learns which ones work best on it’s own.
All these efforts combined resulted in a total accuracy of 96% against the test set and was submitted in this form. This could likely be improved further through more data augmentation (e.g. brightness, tilting and shadows) and more complex neural networks. Methods I learned in later projects, like saving the model whenever it improves during training and early termination when no progress is made anymore shortens iterations and helps to avoid overfitting the model.