I am an ML research engineer at Ford Motor Company where I work on computer vision and machine learning for perception features in the context of automated driving. Most of my work is on camera images and LiDAR point clouds.
In my free time I enjoy playing/watching soccer, kickboxing, hiking (waterfall hikes are the best!) and practically any outdoor sport.
Email |
LinkedIn |
Github
Google Scholar |
Twitter
Until this paper came about, there was work done to use attention on text (Neural Machine Translation) and images (Show Attend and Tell)
The authors propose a new architecture based on attention mechanism that is parallelizable and trains fast called the Transformer.
In recurrent networks, current hidden state is a function of previous state and the current input. This sequential relationship makes parallelizing difficult to achieve which can be cumbersome at longer sequences.
Attention mechanism on the other hand model dependencies without regard to distance in input or output sequences.
Transformer architecture moves away from recurrence and relies entirely on attention to learn global dependencies between input and output.
In transformer, learning dependencies between distant positions is a constant operation with the tradeoff being reduced effective resolution since attention weights are averaged. However, this is alleviated with the use of Multi-head attention blocks.
Self attention is relating different positions of a single sequence to compute it’s representation.
An autoregressive (AR) based encoder-decoder model is proposed. The input sequence representations are encoded by the encoder into a continous sequence representations. The decoder uses the encoder output to generate model’s output sequences one element at a time. Since the model is AR, it uses the previously generated symbols as additional input when generating the subsequent symbol.
Attention function is described in the paper as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
You can think of query as the current word, the model is trying to translate or decode and based on that, attention tells you where in the input you should attend to, to find the most relevant information. Key and value are from the input. In self-attention, Q, K, V are all from the input.
Queries and keys are of same dimension dk and values are of dimension dv.
Encoder-decoder attention layers - Queries come from previous decoder layer while keys and values come from the encoder output. Therefore, every position in the decoder can attend over all positions in the input sequence.
Self-attention layers - Queries, keys and values all come from output of the previous layer in encoder or decoder depending on where it is implemented. Each position in the encoder (or decoder) can attend over all positions from previous layer of encoder (or decoder). In decoder, you want to attend from all positions in the decoder until the position in consideration (including). This is achieved by setting all illegal connections to infinity which would be turned to zero by a softmax layer.
The feedforward layers are applied identically to each position separately. The same parameters are used across positions but they differ across layers. Input and output dimensions are dmodel = 512.