This is part two of a three-part series on getting started with RNNs using PyTorch. Part one can be accessed at Building your first RNN - Part 1. Part three is available at Building your first RNN - Part 3

Having described the problem and built the dataset in Part 1, let’s finally start building our model. It’s a good idea to first have a general overview of what we aim to achieve. One might think of something along the following lines.

On a very high level, the first step in a general workflow will be to feed in inputs to an LSTM to get the predictions. Next, we pass on the predictions along with the targets to the loss function to calculate the loss. Finally, we backpropagate through the loss to update our model’s parameters.

Hmm, that sounds easy, right? But how do you actually make it work? Let’s dissect this step by step. We’ll first identify the components needed to build our model, and finally put them to gether as a single piece to make it work.

### The PyTorch paradigm

… before diving in, it’s important to know a couple of things. PyTorch provides implementations for most of the commonly used entities from layers such as LSTMs, CNNs and GRUs to optimizers like SGD, Adam, and what not (Isn’t that the whole point of using PyTorch in the first place?!). The general paradigm to use any of these entities is to first create an instance of `torch.nn.entity`

with some required parameters. As an example, here’s how we instantiate an `lstm`

.

```
# Step 1
lstm = torch.nn.LSTM(input_size=5, hidden_size=10, batch_first=True)
```

Next, we call this object with the inputs as parameters when we actually want to run an LSTM over some inputs. This is shown in the third line below.

```
lstm_in = torch.rand(40, 20, 5)
hidden_in = (torch.zeros(1, 40, 10), torch.zeros(1, 40, 10))
# Step 2
lstm_out, lstm_hidden = lstm(lstm_in, hidden_in)
```

This two-stepped process will be seen all through this tutorial and elsewhere. Below, we’ll go through step 1 of all the modules. We’ll connect the dots at a later stage.

Getting back to code now, let’s dissect our ‘high level’ understanding again.

## 1. Prepare inputs

…

feed in inputsto an LSTM to get the predictions …

To feed in inputs, well, we first need to prepare the inputs. Remember the embedding matrix we described earlier? we’ll use to convert the pair of indices we get from `dataset()`

into the corresponding embedding vectors. Following the general paradigm, we create an instance of `torch.nn.Embedding`

.

The docs list two required parameters - `num_embeddings: the size of the dictionary of embeddings`

and `embedding_dim: the size of each embedding vector`

. In our case, these are `vocab_size`

and `embedding_dim`

respectively.

```
# Step 1
embed = torch.nn.Embedding(vocab_size, embedding_dim)
```

Later on, we could easily convert any input tensor `ecrypted`

containing indices of the encrypted input (like the one we get from `dataset()`

) into the corresponding embedding vectors by simply calling `embed(encrypted)`

.

As an example, the word `SECRET`

becomes `ERPDRF`

after encryption, and the letters of `ERPDRF`

correspond to the indices `[4, 17, 15, 3, 17, 5]`

. If `encrypted`

is `torch.tensor([4, 17, 15, 3, 17, 5])`

, then `embed(encrypted)`

would return something similar to the following.

```
# Step 2
>>> encrypted = torch.tensor([4, 17, 15, 3, 17, 5])
>>> embedded = embed(encrypted)
>>> print(embedded)
tensor([[ 0.2666, 2.1146, 1.3225, 1.3261, -2.6993],
[-1.5723, -2.1346, 2.6892, 2.7130, 1.7636],
[-1.9679, -0.8601, 3.0942, -0.8810, 0.6042],
[ 3.6624, -0.3556, -1.7088, 1.4370, -3.2903],
[-1.5723, -2.1346, 2.6892, 2.7130, 1.7636],
[-1.8041, -1.8606, 2.5406, -3.5191, 1.7761]])
```

## 2. Build an LSTM

… feed in inputs

to an LSTMto get the predictions …

Next, we need to create an LSTM. We do this in a similar fashion by creating an instance of `torch.nn.LSTM`

. This time, the docs list the required parameters as `input_size: the number of expected features in the input`

and `hidden_size: the number of features in the hidden state`

. Since LSTMs typically operate on variable length sequences, the `input_size`

refers to the size of each entity in the input sequence. In our case, this means the `embedding_dim`

. This might sound counter-intuitive, but if you think for a while, it makes sense.

`hidden_size`

, as the name suggests, is the size of the hidden state of the RNN. In case of an LSTM, this refers to the size of both, the `cell_state`

and the `hidden_state`

. Note that the hidden size is a hyperparameter and *can be different* from the input size. colah’s blog post doesn’t explicitly mention this, but the equations on the PyTorch docs on LSTMCell should make it clear. To summarize the discussion above, here is how we instantiate the LSTM.

```
# Step 1
lstm = torch.nn.LSTM(embedding_dim, hidden_dim)
```

### A note on dimensionality

During step 2 of the general paradigm, `torch.nn.LSTM`

expects the input to be a 3D input tensor of size `(seq_len, batch, embedding_dim)`

, and returns an output tensor of the size `(seq_len, batch, hidden_dim)`

. We’ll only feed in one input at a time, so `batch`

is always `1`

.

As an example, consider the input-output pair `('ERPDRF', 'SECRET')`

. Using an `embedding_dim`

of 5, the 6 letter long input `ERPDRF`

is transformed into an input tensor of size `6 x 1 x 5`

. If `hidden_dim`

is 10, the input is processed by the LSTM into an output tensor of size `6 x 1 x 10`

.

Generally, the LSTM is expected to run over the input sequence character by character to emit a probability distribution over all the letters in the vocabulary. So for every input character, we expect a dimensional output tensor where is 27 (the size of the vocabulary). The most probable letter is then chosen as the output at every timestep.

If you have a look at the output of the LSTM on the example pair `('ERPDRF', 'SECRET')`

above, you can instantly make out that the dimensions are not right. The output dimension is `6 x 1 x 10`

- which means that for every character, the output is a (10) dimensional tensor instead of the expected 27.

So how do we solve this?

## 3. Transform the outputs

… feed in inputs to an LSTM to

get the predictions…

The general workaround is to transform the dimensional tensor into a dimensional tensor through what is called an affine (or linear) transform. Sparing the definitions aside, the idea is to use matrix multiplication to get the desired dimensions.

Let’s say the LSTM produces an output tensor of size `seq_len x batch x hidden_dim`

. Recall that we only feed in one example at a time, so `batch`

is always `1`

. This essentially gives us an output tensor of size `seq_len x hidden_dim`

. Now if we multiply this output tensor with another tensor of size `hidden_dim x embedding_dim`

, the resultant tensor has a size of `seq_len x embedding_dim`

. Isn’t this exactly what we wanted?

To implement the linear layer, … you guessed it! We create an instance of `torch.nn.Linear`

. This time, the docs list the required parameters as `in_features: size of each input sample`

and `out_features: size of each output sample`

. Note that this only transforms the last dimension of the input tensor. So for example, if we pass in an input tensor of size `(d1, d2, d3, ..., dn, in_features)`

, the output tensor will have the same size for all but the last dimension, and will be a tensor of size `(d1, d2, d3, ..., dn, out_features)`

.

With this knowledge in mind, it’s easy to figure out that `in_features`

is `hidden_dim`

, and `out_features`

is `vocab_size`

. The linear layer is initialised below.

```
# Step 1
linear = torch.nn.Linear(hidden_dim, vocab_size)
```

With this we’re preddy much done with the essentials. Time for some learning!

## 4. Calculate the loss

Next, we pass on the predictions along with the targets to the loss function to calculate the loss.

If you think about it, the LSTM is essentially performing multi-class classification at every time step by choosing one letter out of the 27 characters of the vocabulary. A common choice in such a case is to use the cross entropy loss function `torch.nn.CrossEntropyLoss`

. We initialize this in a similar manner.

```
loss_fn = torch.nn.CrossEntropyLoss()
```

You can read more about cross entropy loss in the excellent blog post by Rob DiPietro.

## 5. Optimize

Finally, we backpropagate through the loss to update our model’s parameters.

A popular choice is the Adam optimizer. Here’s how we initialize it. Notice that almost all torch layers have this convenient way of getting all their parameters by calling `module.parameters()`

.

```
optimizer = torch.optim.Adam(list(embed.parameters()) + list(lstm.parameters())
+ list(linear.parameters()), lr=0.001)
```

To summarize, here’s how we initialize the required layers.

We’ll wrap this up and consolidate the network in Part 3