Intuition behind LSTM , GRU and Attention

5 min readDec 3, 2020

LSTM

LSTM stands for Long-Short term memory because it can take care of long term dependencies as well as short term dependencies by carry forwarding the information using cell state vector.

In above image C _t-1 denotes cell state from the previous cell. We are performing element wise multiplication to this vector and adding new values to this vector elementwise to create a new context vector denoted by C_t.

Let say at time step t the cell state C_t-1 has consolidated information from all previous time steps coming as input to our LSTM cell. At element wise multiplication junction if we are multiplying the cell state by a vector containing all 1’s and at element wise addition junction if we add to cell state a vector containing all 0’s . In that case neither we are eliminating any information nor we are adding any new information to cell state vector. Thus, the new cell state vector at time step t would be equal to previous cell state vector at timestep t-1. Recalling from ResNets , in this case the Cell state acts as skip connections i.e. we can jump over layers.

Let us understand this context vector using an analogy of a white board. We have a fixed size state vector which is similar to a white board and as time passes we keep on writing things on white board(state vector or context vector).But we know that our white board(state vector or context vector) is of fixed size. And if we just keep on writing on it, after sometime everything will become messy such that it would become very difficult for us to extract any information from it.

To get rid of above situation we follow below strategy in LSTM :

Selectively Write on board
Selectively Read the already written content
Selectively Forget (Erase) some content

Step-by-Step LSTM Walk Through

Forget Gate :

The forget gate in LSTM acts like a valve that regulates the flow of information from previous cell state vector to new cell state vector thus implementing selective forget operation. It tells us “ How much of the previous information you should retain or forget”

It takes in as input the concatenated vector of previous hidden state and current input, which is the passed on to a fully connected layer. The output of this fully connected layer is passed to a sigmoid function which squashes all the values to between 0 and 1 for each number in the cell state C_t−1.

This vector of values between 0 and 1 is then multiplied to previous cell state to selectively forget/erase some information.

Input Gate :

After selective forget , we perform selective write. This get simply tells us “ How much additional information to add to cell state based on input in current state”.

Output Gate :

Finally, we need to decide what we’re going to output. This output will be based on our current input, previous output and filtered current cell state. First, the concatenated ,[current input and previous output], vector is passed through a fully connected layer, at the output of which we have a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

GRU

GRU stands for Gated Recurrent Unit. It is a simplified version inspired from LSTM which is faster to train and is as powerful as LSTM.

It combines the forget and input gates into a single “update gate”.
It also merges the cell state and hidden state
It updates the memory twice, the first time (using old state and new input, called Reset Gate) and the second time (as final output).
Old cell state or hidden state (with input) is used for its own update as well as for deciding.

Encoder-Decoder :

Encoder : Encoder reads the input once and encodes it into a latent vector consolidating input information.

Decoder : Decoder uses this latent vector as input and decodes the same to generate the output.

The above architecture represents the basic seq-seq model and thus they cannot be used for complex applications. The single hidden vector cannot encapsulate all the information from a sequence.

Attention (Intuition) :

For understanding attention, let’s first try to understand .How humans try to translate ?

Humans try to produce each word in output by focusing on certain words in input.

Consider following example of language translation :

Output : I am going home.

Input : Mai ghar jaa raha hu.

For predicting ‘I’ as output, Our model just need to focus on word ‘Mai’, so we can say that for predicting ‘I’ , ‘Mai’ has a weight of 1 and everything else has a weight of zero.

For predicting ‘going’ as output, Our model need to focus on word ‘jaa’ and ‘raha’ both words equally , so we can say that for predicting ‘going’ , ‘jaa’ and ‘raha’ both have a equal weight of 0.5 and everything else has a weight of zero.

Thus the analogy over here is instead of encoding the sentence once and giving it to decoder and saying that now you figure out how to generate the output, It makes more sense that at every time step we give a refined encoding and say that, at this particular time step, this is the only part of input you care about and you don’t care about any other parts of input.

Attention (Overview) :

Our attention model has a single RNN encoder, again with 5-time steps. We denote the encoder’s input vectors by x 1 , x 2 , x 3 , x 4 and the output vectors by h 1 , h 2 , h 3 , h 4.The attention mechanism is located between the encoder and the decoder.

From Encoder-Decoder architecture we know that