Transformers evolved from attention-based RNNs when recurrence was removed from them. In layman’s terms, each layer of a Transformer tries to compare each atom of information (words for text and patches for images) with every other atom. While CNNs use the combination of convolution and pooling which only can compare spatially adjacent atoms while pooling the response from the convolution layers.
I cannot miss the resemblance of these two architectures with sorting algorithms. In this analogy, a Transformer layer is the \(O(n^2)\) time complexity sorting algorithms (non-parallelised) like selection sort or bubble sort while CNNs are the \(O(n\log n)\) algorithms like merge sort1. This also accounts for the higher complexity of Transformers2. Further, the comparisons at different receptive fields at different spatial resolutions in the CNNs are quite similar in structure to the recursive hierarchical comparison structure in mergesort. For example, in merge sort when we found the order in one level of recurrence we need not do comparisons among these sets of atoms again. Similarly, in CNNs pooling a patch leads to aggregation of information among all the atoms in a region and we need not compare them again amongst themselves.
The only difference between sorting and these deep neural network architectures is that in the case of CNNs, pooling leads to loss of information due to which we cannot make sure that every atom gets compared to every other atom3. Sorting assumes monotonicity among the atoms (a way to order the results of comparisons) which may not be the case for tasks that deep neural networks solve. In CNNs, only the pooled aggregations get “compared” with their neighbours while due to monotonicity aggregations of sorted subparts still retain all the information about their order. This problem does not occur in Transformers which uses a brute force way of comparison.
I feel this may be the reason why Transformers are conceptually better than CNNs even though they may be way more inefficient. Ideally, the hierarchical nature of CNNs should extract all the essence from the input but in practice, it lacks the rigour that Transformers have and at the same time does not have the benefit of simplified input structure assumptions necessary for efficient sorting.
Thanks to Dr. Kamal Gupta for their useful feedback to make this article more readable and explicit.
The \(\log n\) in the complexity comes from the number of layers required to have the possibility of comparison of all atoms in a CNN’s input. Some may say that one layer of Transformer is not a fair comparison for \(O(\log n)\) layers of convolution-pooling layers. But here I was trying to look at how much complexity is required to compare all the atoms (or at least have an illusion of it) which Transformers can do in a single layer. ↩
A multi-layer transformer will actually have a \(O(n^2 \log n)\) time complexity when \(O(\log n)\) layers are used. This is required so that higher-level concepts can be compared apart from low-level atoms. ↩
Attention Pooling seems to be a \(O(n)\) complexity compromise between standard pooling and a Transformer layer. If we were to use a fully functional Transformer layer instead for pooling then one could argue that why do even need convolutions anymore because Attention is all you need ! ↩