联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-04-07 10:03

CPEN400D: Deep Learning

Programming Assignment 2

Released: 2023 Mar. 24

In this assignment, you will implement a Transformer [1] architecture from scratch. A Colab

notebook is provided with this documentation, which you need to complete. Please pay attention

to the following notes:

The notebook is available at this link.

You are only allowed to modify the code inside the START/END blocks.

For each module, one test case is provided at the end of the notebook. You may use the

test cases to get a sense of how the input/output formats are, and as a sanity check for your

implementation. For each test case, your code must give a response within 10 seconds.

Make sure your code is readable (by leaving comments and writing self-commented code).

Unless otherwise specified, you are not allowed to use any module from torch.nn module.

For each subsection, there is a corresponding START/END block in the provided colab that

you need to fill in.

All the code must be written by yourself, and you are not allowed to copy any code from

other sources such as your classmates or the internet.

If you use ChatGPT or other LLMs to help finish the assignment, please clearly mark which

questions you use them. We may require you to submit your prompts. Please make sure you

store them properly.

1 Transformer

In this problem, we will implement a Transformer from scratch and train it on a toy task. The setup

of the problem is as follows: We are given training/validation/test sets of size Ntrain, Nval, Ntest,

respectively. Each dataset contains a set of strings consisting of four English alphabet characters

“c”,“p”,“e”, and “n”. The task is to verify if the sequence contains “cpen” as a contiguous substring

or not. For instance “ecpenp” contains this substring and has label 1, while “cpeeen” does not

contain the substring and has label 0. Each string si (1 ≤ i ≤ N) in each dataset contains sequences

of length n. For exact details of this task, you may refer to the implementation of SubstringDataset

in the notebook. An illustrative figure of this task is shown in Figure 1. Throughout this assignment,

we omit the batch size in the equations for simplicity but your implementations must take into

account the batch size.

1

UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2

Transformer

[CLS] e c p e n

1

p

Figure 1: Substring matching using a Transformer. A string is first tokenized into a sequence of

characters. A “[CLS]” token is added to the beginning of the sequence. The sequence is encoded

and passed through a Transformer. Finally, the output of the “[CLS]” token is decoded as the class

label. The label is 1 if the string is matched, 0 otherwise.

Preprocessing and Tokenization

First, we need to convert our dataset, which is represented as string data, to a vectorized format.

To this end, we first need to break the string into small pieces. This process is called Tokenization.

Next, we have to convert each of the small pieces into a vectorized format, usable by a neural

network.

1.1 [10pts] Implement the tokenization and vectorization functionality in the Tokenizer class.

This function takes in a string, (1) splits the string, and returns a list of characters (or tokens

in Transformer terminology) (2) (optionally) add a “[CLS]” token to the beginning of the list

(later, we will see why we need this). (3) convert each token into a one-hot vector and return the

resulting matrix. For a string of length n, this function must return a row-wise one-hot matrix

X ∈ R(n+1)×dvoc where dvoc = 4 + 1 = 5 is the size of our token vocabulary.

Positional Encoding

The Transformer model is designed to process sequences of tokens, but it lacks any inherent under-

standing of the order or position of those tokens within the sequence. This means that the model

would treat the same sequence of tokens differently depending on their order within the input. To

overcome this limitation, the Transformer architecture introduces positional encoding.

Positional encoding is a way of adding information about the position of each token in the

sequence to the input embeddings before passing them through the Transformer layers. This allows

the model to distinguish between tokens based on their position in the sequence.

There are various forms of imposing positional encoding such as sinusoidal positional encoding

introduced in the original Transformer paper. Here, we will implement another version, which

is learnable positional encoding. The idea behind learnable positional encoding is to dedicate a

learnable vector to each position index i, instead of a fixed sinusoidal vector.

1.2 [5pts] Implement the (absolute) learnable positional encoding module. This module has a

learnable weight matrix Wpos ∈ RdmaxLen×dmodel . The i-th row (1 ≤ i ≤ dmaxLen) of this matrix is

2

UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2

a learnable vector corresponding to the i-th position in a sequence. This module applies positional

encoding to a sequence by element-wise adding rows of this matrix to their corresponding position

in the input.

Multi-Head Attention

Attention is the key component of the Transformer, which we will implement next. Attention

allows the model to attend to different parts of the input sequence with different weights, enabling

it to capture complex relationships between the input tokens. This module consists of numHeads

heads, each head h consists of three weight matrices: WK,h,WQ,h,WV,h ∈ Rdmodel×dh , where

dh = dmodel/dh. Multi-head attention takes three matrices XK ,XQ,XV ∈ Rn×dmodel as input,

where n is the number of tokens. The computational mechanism of each head is as follows:

headh = Softmax,

where Sofotmax is a row-wise softmax operator. The result of each head is a n × dh matrix. To

merge the outputs of heads and obtain a final output of size n × dmodel, heads are concatenated

and multiplied by a weight matrix WO ∈ Rdmodel×dmodel :

Attention(XK ,XQ,XV ) = Concat(head1,head2, . . . ,headH)WO.

In this assignment (and in Transformer Encoder in general), we use Self-attention where XK =

XQ = XV = X are all the same matrices.

1.3 [20pts] Implement the multi-head attention module.

1.5 [5pts] An alternative way of imposing positional encoding in the Transformer is to apply it in

the attention Softmax operator. That is, modify the equation of each head as follows:. In each head, this positional

encoding adds a scalar value of mi?j,h to the attention score of each pair of tokens with indices

i, j (1 ≤ i, j ≤ n). Since in this positional encoding, only the relative position of each pair of tokens

is important, this positional encoding is called Relative Positional Encoding (RPE). Implement

relative positional encoding (RPE) in the multi-head attention module.

3

UBC CPEN400D 2022 Winter Term 2 with Professor Renjie Liao Assignment 2

Transformer Backbone

Next, we will proceed with the implementation of Transformer layers and model. Each layer of the

transformer contains four components: self-attention, feed-forward layer, and normalization. The

feed-forward layer is a two-layer fully connected network with ReLU activation. It contains two

weight matrices W1 ∈ Rdmodel×4dmodel ,W2 ∈ R4dmodel×dmodel , and its output is given by:

FC(X) = ReLU(XW1)W2.

1.4 [15pts] Implement the transformer layer module. The prenorm parameter determines whether

the transformer layer is Pre-Norm or Post-Norm.

1.5 [15pts] Implement the transformer model module. This module contains an encoder, nLayers

layers of transformer layer, and a decoder. The encoder weightWenc ∈ Rdinput×dmodel is a linear layer

that takes in tokens of dimension dinput = dvocab and encodes each of them into a high-dimensional

space of dmodel. After applying the required layers of the Transformer layer, the decoder weight

matrix Wdec ∈ Rdmodel×dout decodes each token from a high-dimensional space into the output

space.

Optimization Scheduling

Next, we implement the optimizer and its scheduler. We choose Adam optimizer as the optimizer.

Transformers often require warm-up and cool-down scheduling for stable training and better/faster

generalization. That is, in the early steps of optimization, the learning rate is gradually increased

from zero to the base learning rate, and in the final stages of optimization, the learning rate is

decreased to zero again. Here, we implement a simple scheduler with linear warmup and cooldown.

1.6 [5pts] Implement a scheduler with warmUp warmup steps and maxSteps total number of steps.

In other words, your scheduler must return a zero learning rate at step 0, increase the learning rate

linearly to lr until step warmUp, and decrease it again linearly to zero until step maxSteps.

Train Substring Matching

Finally, we have all the pieces necessary to train a substring matching model using the Transformer

architecture. The Trainer class contains a minimal training and evaluation procedure for this task.

1.7 [15pts] Implement the loss computation function in the Trainer class. The Transformer contains

n output vectors for input with n tokens. However, our task is to have a single output for the entire

sequence (i.e. the label). Hence, we have added a dummy token called “[CLS]” token, and will only

look at the output of this token to compute the loss and accuracy. The loss/accuracy computation

function should compute the predicted label, and binary cross entropy with the ground truth labels,

and return both the loss and accuracy for a given batch. You may use the cross entropy loss function

from PyTorch to compute loss.

1.8 [10pts] Run the training procedure to make sure your code is correct and the model trains

properly. In case you were not successful in implementing some of the previous parts, you may use

a module from PyTorch (e.g., Attention, Transformer, etc) to implement this part and receive the

full points for this part. If you have implemented everything correctly, your code should achieve

close to perfect accuracy on the test set.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp