What is Capsule?

2 min readMar 2, 2021

A Hinton’s work to mimic the biological design of human-detection system

I have learned something new today — Here is an easier-to-understand doc about capsules:

Source: https://theverybesttop10.com/cats-sleeping-in-strange-and-uncomfortable-positions/

Why capsules?

Capsules address a limitation of CNN and pooling. CNN extracts feature regardless of order**. Pooling discard information instead of ‘disentangling’ information. A normal human being would see the part-whole structure, not pixel-level. To solve this, Hinton suggested capsules, a structure (a group of neurons representing the same object part) within Capsule Neural Network (CapsNet). Capsules allow CapsNet to “arrange” low-level features in a certain order. It mimics how a human sees, we don’t simply recognize a nose in the face, we see the nose in relationship with the two eyes. This principle applies to both images and NLP.

*Notes: CNN can take care of affine transformation, but the training dataset must grow exponentially, so this explosion does not work for large problems. CapsNet assumes one location hosts only one object instance, so that capsule can represent that object using an activation vector.

Biological inspiration

In our brain, there is a thing called cortical minicolumns (yeah, they are vertical-like structures) — those columns contain 80–120 neurons with the same receptive field. Cortical mini-columns inspire capsules. ==> having a neuroscientist degree would be helpful

Math intuition

Activation in a neural net is a likelihood to detect a feature. In CNN, activation output is a scalar. In a CapNest, we replace the `scalar, real-valued activation` with `vector-output capsules` and replace `max-pooling` with `route-by-agreement` (basically, indicates if the child capsule is routed to the parent capsules in the next layer all the time ==> the presence of the child indicates the presence of the parent in the scene). Since capsules are independent, if they agree, it means a higher chance of correct detection.

One thing to remember

If I can choose one thing to remember, it’d be: Maxpooling is a mistake (yeah, Hinton said that) because of information loss, violation of biological shape detection, etc. The fact that maxpooling working so well is a disaster (Hinton said that as well).

Papers:

How to represent part-whole hierarchies in a neural network

This paper does not describe a working system. Instead, it presents a single idea about representation which allows…

arxiv.org