一步一步實現Transformer

首頁>技術>AI約讀社2020-12-22 06:06

一步一步實現Transformer

GPT-3，BERT，XLNet這些都是當前自然語言處理（NLP）的新技術，它們都使用一種稱為 transformer 的特殊架構元件，這是因為，transformer 這種新機制非常強大，完整的transformer 通常包含三個結構：

scaled dot-product attentionself-attentioncross-attentionmulti-head attentionpositional encoding

讓我們從Scaled Dot-Product Attention開始，因為我們還需要它來構建 Multi-Head Attention。

Scaled Dot-Product Attention

在數學上，Scaled Dot-Product Attention表示為：

Q，K和V是經過卷積後得到的特徵，其形狀為（batch_size，seq_length，num_features）。

將查詢（Q）和鍵（K）相乘會得到（batch_size，seq_length，seq_length）特徵，這大致告訴我們序列中每個元素的重要性，確定我們“注意”哪些元素。注意陣列使用softmax標準化，因此所有權重之和為1。最後，注意力將透過矩陣乘法應用於值（V）陣列。

scaled dot-product attention 的程式碼非常簡單-只需幾個矩陣乘法，再加上softmax函式。為了更加簡單，我們省略了可選的Mask操作。

from torch import Tensorimport torch.nn.functional as fdef scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:    temp = query.bmm(key.transpose(1, 2))    scale = query.size(-1) ** 0.5    softmax = f.softmax(temp / scale, dim=-1)    return softmax.bmm(value)

請注意，MatMul操作在PyTorch中對應為torch.bmm。這是因為Q，K和V（查詢，鍵和值陣列）都是矩陣，每個矩陣的形狀均為（batch_size，sequence_length，num_features），矩陣乘法僅在最後兩個維度上執行。

self-attention的Q，K和V都是同一個輸入, 即當前序列由上一層輸出的高維表達。cross-attention的Q代表當前序列；而K和V是同一個輸入，對應的是encoder最後一層的輸出結果Multi-Head Attention

從上圖可以看出， Multi-Head Attention 由幾個相同的Head Attention組成。每個關注頭包含3個線性層，

程式碼如下：

import torchfrom torch import nnclass HeadAttention(nn.Module):    def __init__(self, dim_in: int, dim_k: int, dim_v: int):        super().__init__()        self.q = nn.Linear(dim_in, dim_k)        self.k = nn.Linear(dim_in, dim_k)        self.v = nn.Linear(dim_in, dim_v)    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

現在，建立Multi-Head Attention 就非常容易。只需將num_heads個不同的關注頭和一個Linear層組合在一起即可輸出。

class MultiHeadAttention(nn.Module):    def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):        super().__init__()        self.heads = nn.ModuleList(            [HeadAttention(dim_in, dim_k, dim_v) for _ in range(num_heads)]        )        self.linear = nn.Linear(num_heads * dim_v, dim_in)    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:        return self.linear(            torch.cat([h(query, key, value) for h in self.heads], dim=-1)        )

Positional Encoding

在構建完整的transformer之前，我們還需要一個元件：Positional Encoding。請注意，MultiHeadAttention沒有在序列維度上執行，一切都在特徵維上進行，因此它與序列長度無關。我們必須向模型提供位置資訊，以便它知道輸入序列中資料點的相對位置。

transformer 論文裡使用三角函式對位置資訊進行編碼：

為什麼使用正弦編碼呢？因為正弦/餘弦函式是週期性的，並且它們覆蓋[0，1]的範圍。所以，儘管事實證明學習的嵌入表現出同樣良好的效果，但作者仍然選擇使用正弦編碼。

我們只需幾行程式碼即可實現：

def position_encoding(    seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),) -> Tensor:    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)    phase = (pos / 1e4) ** (dim // dim_model)    return torch.where(dim.long() % 2 == 0, -torch.sin(phase), torch.cos(phase))

Transformer

最後，我們準備構建“Transformer”了！讓我們再看一下完整的網路圖：

注意，transformer使用編碼器-解碼器體系結構。編碼器（左）處理輸入序列並返回特徵向量（或儲存向量）。解碼器處理目標序列，併合並來自編碼器儲存器的資訊。解碼器的輸出是我們模型的預測！

我們可以彼此獨立地對編碼器/解碼器模組進行編碼，然後最後將它們組合。首先，我們先構建encoder。如下：

def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:    return nn.Sequential(        nn.Linear(dim_input, dim_feedforward),        nn.ReLU(),        nn.Linear(dim_feedforward, dim_input),    )class Residual(nn.Module):    def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):        super().__init__()        self.sublayer = sublayer        self.norm = nn.LayerNorm(dimension)        self.dropout = nn.Dropout(dropout)    def forward(self, *tensors: Tensor) -> Tensor:        # Assume that the "value" tensor is given last, so we can compute the        # residual.  This matches the signature of 'MultiHeadAttention'.        return self.norm(tensors[-1] + self.dropout(self.sublayer(*tensors)))class TransformerEncoderLayer(nn.Module):    def __init__(        self,         dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,      ):        super().__init__()        dim_k = dim_v = dim_model // num_heads        self.attention = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.feed_forward = Residual(            feed_forward(dim_model, dim_feedforward),            dimension=dim_model,            dropout=dropout,        )    def forward(self, src: Tensor) -> Tensor:        src = self.attention(src, src, src)        return self.feed_forward(src)class TransformerEncoder(nn.Module):    def __init__(        self,         num_layers: int = 6,        dim_model: int = 512,         num_heads: int = 8,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        self.layers = nn.ModuleList([            TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)            for _ in range(num_layers)        ])    def forward(self, src: Tensor) -> Tensor:        seq_len, dimension = src.size(1), src.size(2)        src += position_encoding(seq_len, dimension)        for layer in self.layers:            src = layer(src)        return src

解碼器模組非常相似。只是一些小的區別：

解碼器接受兩個引數（target和memory），而不是一個；每層有兩個多頭部注意力模組，而不是一個；第二個多頭注意力接受兩個輸入的記憶；解碼器中包含了self-attention和cross-attention。

class TransformerDecoderLayer(nn.Module):    def __init__(        self,         dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        dim_k = dim_v = dim_model // num_heads        self.attention_1 = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.attention_2 = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.feed_forward = Residual(            feed_forward(dim_model, dim_feedforward),            dimension=dim_model,            dropout=dropout,        )    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:        tgt = self.attention_1(tgt, tgt, tgt)        tgt = self.attention_2(memory, memory, tgt)        return self.feed_forward(tgt)class TransformerDecoder(nn.Module):    def __init__(        self,         num_layers: int = 6,        dim_model: int = 512,         num_heads: int = 8,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        self.layers = nn.ModuleList([            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)            for _ in range(num_layers)        ])        self.linear = nn.Linear(dim_model, dim_model)    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:        seq_len, dimension = tgt.size(1), tgt.size(2)        tgt += position_encoding(seq_len, dimension)        for layer in self.layers:            tgt = layer(tgt, memory)        return torch.softmax(self.linear(tgt), dim=-1)

最後，我們需要將所有內容打包成一個Transformer類，只要把一個編碼器和解碼器放在一起，然後以正確的順序透過它們傳遞資料。

class Transformer(nn.Module):    def __init__(        self,         num_encoder_layers: int = 6,        num_decoder_layers: int = 6,        dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,         activation: nn.Module = nn.ReLU(),    ):        super().__init__()        self.encoder = TransformerEncoder(            num_layers=num_encoder_layers,            dim_model=dim_model,            num_heads=num_heads,            dim_feedforward=dim_feedforward,            dropout=dropout,        )        self.decoder = TransformerDecoder(            num_layers=num_decoder_layers,            dim_model=dim_model,            num_heads=num_heads,            dim_feedforward=dim_feedforward,            dropout=dropout,        )    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:        return self.decoder(tgt, self.encoder(src))

讓我們建立一個簡單的測試，作為實現的健全性檢查。我們可以構造src和tgt的隨機張量，檢查我們的模型執行沒有錯誤，並確認輸出張量具有正確的形狀。

src = torch.rand(64, 16, 512)tgt = torch.rand(64, 16, 512)out = Transformer()(src, tgt)print(out.shape)# torch.Size([64, 16, 512])

Conclusions

希望這篇有助於瞭解transformer是如何搭建的，以及它們是如何工作的。計算機視覺領域，以前可能沒有遇到過這些模型，但DETR和ViT已經取得了突破性的成果，預計在未來幾年裡會看到更多這樣的模型。

最新評論

劇多

一步一步實現Transformer

相關內容