← Back to the schedule

Teaching a robot Guess Who? From Deep Learning to Brain Spikes

Calendar icon

Thursday 15th

Time icon

10:45 | 11:25

Location icon

Theatre 20


Keywords defining the session:

- Spatial Transformers

- 3plet loss

- Brain Spikes

Takeaway points of the session:

- CNNs vulnerabilities: Spatial Transformer Networks to alleviate their shortcomings and building a robust dataset for identity recognition tasks (applicable to other computer vision problems) by applying different types of noise and kernel based affine transformations.

- Using embedding representation for multiclass image classification and understanding how to apply triplet loss to your own dataset.


Traditional Convolutional Neural Networks are a widespread architecture among Deep Learning practitioners, just as a reminder: they’re based on the use of small (usually 3×3 or 5×5) matrices called kernels whose element values are learnt by the network itself. Kernels are applied to the images to process by means of an operation called convolution and, as a result, different features from the original image are remarked. The most shallow layers will be able to extract the most basic features and the deeper it gets, most complex features will be retrieved using as a base the previous ones. When tackling the problem of making our Pepper robot able to recognize our colleagues, we noticed that the different CNN based architecture we tried were not complex enough (or maybe too linear) to extract the necessary features to be able to distinguish one face from a different one (something tricky since most of us look very similar: ears, eyes, eyebrows, lips, mouth, nose…), limited data is also a problem, but one of the most overlooked problems is CNN’s vulnerability: it’s extremely easy to deceive them by making human eye imperceptible changes to an image.
The first serious change we attempted was to drop the typical Softmax layer used for most classification problems in Deep Learning and let the network output an embedding representation, which is basically to represent each image as a n position vector. By letting the network learn how to represent each image as an embedding we’re actually being able to represent each image for each class (each colleague) as a vector in a n-dimensional space, in a way that images from the same person should be close to each other but images from different looking persons should be far from each other.
So we need our network to learn a robust embedding function and in order to achieve this we must provide it our training set in a different way. Traditionally the training set consisted on image-label pairs; with the mentioned approach we’ll provide for our network to train, triplets of images. Each triplet consists of an anchor (this is an image of the person associated to the label), a positive example (an image of the person associated to the label but different from the anchor) and a negative example (an image of a person associated to a different label).
An important advantage of this triplet loss+embedding approach is that, if the learnt embedding function is robust enough, adding a new person for our robot to identify should be as easy as providing an image and a label of our new colleague, avoiding all the hassle of gathering thousands of pictures and retraining with the whole dataset.

So far we’ve put all the stress in the architecture, but what about data? Proper data augmentation will help us making our network more robust: here we’ve worked towards two different directions:

– Kernel rotational invariance, which aims to improve the network performance to pattern rotations present in the input images.
– Noisy images, which distorts original images by applying different kinds of noise to them.

Addionally, to mitigate the vulnerability of this kind of architectures we also train it using Spatial Transformer Networks that learn automatically during the training of the main convolutional network to apply different types of transformations to their Feature Maps.

Finally, we want to introduce the third generation of Neural Networks called Spiking Neural Networks. The future of artificial intelligence must advance by studying more deeply different types of biological brains and trying to emulate them in a more faithful way. The idea behind Spiking Neural Networks is to follow that path that allows us to move towards a General Artificial Intelligence. as a different class.