
GPT-3,BERT,XLNet這些都是當前自然語言處理(NLP)的新技術,它們都使用一種稱為 transformer 的特殊架構元件,這是因為,transformer 這種新機制非常強大,完整的transformer 通常包含三個結構:

scaled dot-product attentionself-attentioncross-attentionmulti-head attentionpositional encoding

讓我們從Scaled Dot-Product Attention開始,因為我們還需要它來構建 Multi-Head Attention。

Scaled Dot-Product Attention

在數學上,Scaled Dot-Product Attention表示為:


將查詢(Q)和鍵(K)相乘會得到(batch_size,seq_length,seq_length)特徵,這大致告訴我們序列中每個元素的重要性,確定我們“注意”哪些元素。 注意陣列使用softmax標準化,因此所有權重之和為1。 最後,注意力將透過矩陣乘法應用於值(V)陣列。

scaled dot-product attention 的程式碼 非常簡單-只需幾個矩陣乘法,再加上softmax函式。 為了更加簡單,我們省略了可選的Mask操作。

from torch import Tensorimport torch.nn.functional as fdef scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:    temp = query.bmm(key.transpose(1, 2))    scale = query.size(-1) ** 0.5    softmax = f.softmax(temp / scale, dim=-1)    return softmax.bmm(value)

請注意,MatMul操作在PyTorch中對應為torch.bmm。 這是因為Q,K和V(查詢,鍵和值陣列)都是矩陣,每個矩陣的形狀均為(batch_size,sequence_length,num_features),矩陣乘法僅在最後兩個維度上執行。

self-attention的Q,K和V都是同一個輸入, 即當前序列由上一層輸出的高維表達。cross-attention的Q代表當前序列;而K和V是同一個輸入,對應的是encoder最後一層的輸出結果Multi-Head Attention

從上圖可以看出, Multi-Head Attention 由幾個相同的Head Attention組成。 每個關注頭包含3個線性層,


import torchfrom torch import nnclass HeadAttention(nn.Module):    def __init__(self, dim_in: int, dim_k: int, dim_v: int):        super().__init__()        self.q = nn.Linear(dim_in, dim_k)        self.k = nn.Linear(dim_in, dim_k)        self.v = nn.Linear(dim_in, dim_v)    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

現在,建立Multi-Head Attention 就非常容易。 只需將num_heads個不同的關注頭和一個Linear層組合在一起即可輸出。

class MultiHeadAttention(nn.Module):    def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):        super().__init__()        self.heads = nn.ModuleList(            [HeadAttention(dim_in, dim_k, dim_v) for _ in range(num_heads)]        )        self.linear = nn.Linear(num_heads * dim_v, dim_in)    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:        return self.linear(            torch.cat([h(query, key, value) for h in self.heads], dim=-1)        )
Positional Encoding

在構建完整的transformer之前,我們還需要一個元件:Positional Encoding。 請注意,MultiHeadAttention沒有在序列維度上執行, 一切都在特徵維上進行,因此它與序列長度無關。 我們必須向模型提供位置資訊,以便它知道輸入序列中資料點的相對位置。

transformer 論文裡使用三角函式對位置資訊進行編碼:

為什麼使用正弦編碼呢? 因為正弦/餘弦函式是週期性的,並且它們覆蓋[0,1]的範圍。所以,儘管事實證明學習的嵌入表現出同樣良好的效果,但作者仍然選擇使用正弦編碼。


def position_encoding(    seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),) -> Tensor:    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)    phase = (pos / 1e4) ** (dim // dim_model)    return torch.where(dim.long() % 2 == 0, -torch.sin(phase), torch.cos(phase))

最後,我們準備構建“Transformer”了! 讓我們再看一下完整的網路圖:

注意,transformer使用編碼器-解碼器體系結構。 編碼器(左)處理輸入序列並返回特徵向量(或儲存向量)。 解碼器處理目標序列,併合並來自編碼器儲存器的資訊。 解碼器的輸出是我們模型的預測!

我們可以彼此獨立地對編碼器/解碼器模組進行編碼,然後最後將它們組合。 首先,我們先構建encoder。如下:

def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:    return nn.Sequential(        nn.Linear(dim_input, dim_feedforward),        nn.ReLU(),        nn.Linear(dim_feedforward, dim_input),    )class Residual(nn.Module):    def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):        super().__init__()        self.sublayer = sublayer        self.norm = nn.LayerNorm(dimension)        self.dropout = nn.Dropout(dropout)    def forward(self, *tensors: Tensor) -> Tensor:        # Assume that the "value" tensor is given last, so we can compute the        # residual.  This matches the signature of 'MultiHeadAttention'.        return self.norm(tensors[-1] + self.dropout(self.sublayer(*tensors)))class TransformerEncoderLayer(nn.Module):    def __init__(        self,         dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,      ):        super().__init__()        dim_k = dim_v = dim_model // num_heads        self.attention = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.feed_forward = Residual(            feed_forward(dim_model, dim_feedforward),            dimension=dim_model,            dropout=dropout,        )    def forward(self, src: Tensor) -> Tensor:        src = self.attention(src, src, src)        return self.feed_forward(src)class TransformerEncoder(nn.Module):    def __init__(        self,         num_layers: int = 6,        dim_model: int = 512,         num_heads: int = 8,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        self.layers = nn.ModuleList([            TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)            for _ in range(num_layers)        ])    def forward(self, src: Tensor) -> Tensor:        seq_len, dimension = src.size(1), src.size(2)        src += position_encoding(seq_len, dimension)        for layer in self.layers:            src = layer(src)        return src


class TransformerDecoderLayer(nn.Module):    def __init__(        self,         dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        dim_k = dim_v = dim_model // num_heads        self.attention_1 = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.attention_2 = Residual(            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),            dimension=dim_model,            dropout=dropout,        )        self.feed_forward = Residual(            feed_forward(dim_model, dim_feedforward),            dimension=dim_model,            dropout=dropout,        )    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:        tgt = self.attention_1(tgt, tgt, tgt)        tgt = self.attention_2(memory, memory, tgt)        return self.feed_forward(tgt)class TransformerDecoder(nn.Module):    def __init__(        self,         num_layers: int = 6,        dim_model: int = 512,         num_heads: int = 8,         dim_feedforward: int = 2048,         dropout: float = 0.1,     ):        super().__init__()        self.layers = nn.ModuleList([            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)            for _ in range(num_layers)        ])        self.linear = nn.Linear(dim_model, dim_model)    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:        seq_len, dimension = tgt.size(1), tgt.size(2)        tgt += position_encoding(seq_len, dimension)        for layer in self.layers:            tgt = layer(tgt, memory)        return torch.softmax(self.linear(tgt), dim=-1)


class Transformer(nn.Module):    def __init__(        self,         num_encoder_layers: int = 6,        num_decoder_layers: int = 6,        dim_model: int = 512,         num_heads: int = 6,         dim_feedforward: int = 2048,         dropout: float = 0.1,         activation: nn.Module = nn.ReLU(),    ):        super().__init__()        self.encoder = TransformerEncoder(            num_layers=num_encoder_layers,            dim_model=dim_model,            num_heads=num_heads,            dim_feedforward=dim_feedforward,            dropout=dropout,        )        self.decoder = TransformerDecoder(            num_layers=num_decoder_layers,            dim_model=dim_model,            num_heads=num_heads,            dim_feedforward=dim_feedforward,            dropout=dropout,        )    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:        return self.decoder(tgt, self.encoder(src))


src = torch.rand(64, 16, 512)tgt = torch.rand(64, 16, 512)out = Transformer()(src, tgt)print(out.shape)# torch.Size([64, 16, 512])


