GPT-3,BERT,XLNet這些都是當前自然語言處理(NLP)的新技術,它們都使用一種稱為 transformer 的特殊架構元件,這是因為,transformer 這種新機制非常強大,完整的transformer 通常包含三個結構:
scaled dot-product attentionself-attentioncross-attentionmulti-head attentionpositional encoding讓我們從Scaled Dot-Product Attention開始,因為我們還需要它來構建 Multi-Head Attention。
Scaled Dot-Product Attention在數學上,Scaled Dot-Product Attention表示為:
Q,K和V是經過卷積後得到的特徵,其形狀為(batch_size,seq_length,num_features)。
將查詢(Q)和鍵(K)相乘會得到(batch_size,seq_length,seq_length)特徵,這大致告訴我們序列中每個元素的重要性,確定我們“注意”哪些元素。 注意陣列使用softmax標準化,因此所有權重之和為1。 最後,注意力將透過矩陣乘法應用於值(V)陣列。
scaled dot-product attention 的程式碼 非常簡單-只需幾個矩陣乘法,再加上softmax函式。 為了更加簡單,我們省略了可選的Mask操作。
from torch import Tensorimport torch.nn.functional as fdef scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor: temp = query.bmm(key.transpose(1, 2)) scale = query.size(-1) ** 0.5 softmax = f.softmax(temp / scale, dim=-1) return softmax.bmm(value)
請注意,MatMul操作在PyTorch中對應為torch.bmm。 這是因為Q,K和V(查詢,鍵和值陣列)都是矩陣,每個矩陣的形狀均為(batch_size,sequence_length,num_features),矩陣乘法僅在最後兩個維度上執行。
self-attention的Q,K和V都是同一個輸入, 即當前序列由上一層輸出的高維表達。cross-attention的Q代表當前序列;而K和V是同一個輸入,對應的是encoder最後一層的輸出結果Multi-Head Attention從上圖可以看出, Multi-Head Attention 由幾個相同的Head Attention組成。 每個關注頭包含3個線性層,
程式碼如下:
import torchfrom torch import nnclass HeadAttention(nn.Module): def __init__(self, dim_in: int, dim_k: int, dim_v: int): super().__init__() self.q = nn.Linear(dim_in, dim_k) self.k = nn.Linear(dim_in, dim_k) self.v = nn.Linear(dim_in, dim_v) def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor: return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))
現在,建立Multi-Head Attention 就非常容易。 只需將num_heads個不同的關注頭和一個Linear層組合在一起即可輸出。
class MultiHeadAttention(nn.Module): def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int): super().__init__() self.heads = nn.ModuleList( [HeadAttention(dim_in, dim_k, dim_v) for _ in range(num_heads)] ) self.linear = nn.Linear(num_heads * dim_v, dim_in) def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor: return self.linear( torch.cat([h(query, key, value) for h in self.heads], dim=-1) )
Positional Encoding在構建完整的transformer之前,我們還需要一個元件:Positional Encoding。 請注意,MultiHeadAttention沒有在序列維度上執行, 一切都在特徵維上進行,因此它與序列長度無關。 我們必須向模型提供位置資訊,以便它知道輸入序列中資料點的相對位置。
transformer 論文裡使用三角函式對位置資訊進行編碼:
為什麼使用正弦編碼呢? 因為正弦/餘弦函式是週期性的,並且它們覆蓋[0,1]的範圍。所以,儘管事實證明學習的嵌入表現出同樣良好的效果,但作者仍然選擇使用正弦編碼。
我們只需幾行程式碼即可實現:
def position_encoding( seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),) -> Tensor: pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1) dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1) phase = (pos / 1e4) ** (dim // dim_model) return torch.where(dim.long() % 2 == 0, -torch.sin(phase), torch.cos(phase))
Transformer最後,我們準備構建“Transformer”了! 讓我們再看一下完整的網路圖:
注意,transformer使用編碼器-解碼器體系結構。 編碼器(左)處理輸入序列並返回特徵向量(或儲存向量)。 解碼器處理目標序列,併合並來自編碼器儲存器的資訊。 解碼器的輸出是我們模型的預測!
我們可以彼此獨立地對編碼器/解碼器模組進行編碼,然後最後將它們組合。 首先,我們先構建encoder。如下:
def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module: return nn.Sequential( nn.Linear(dim_input, dim_feedforward), nn.ReLU(), nn.Linear(dim_feedforward, dim_input), )class Residual(nn.Module): def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1): super().__init__() self.sublayer = sublayer self.norm = nn.LayerNorm(dimension) self.dropout = nn.Dropout(dropout) def forward(self, *tensors: Tensor) -> Tensor: # Assume that the "value" tensor is given last, so we can compute the # residual. This matches the signature of 'MultiHeadAttention'. return self.norm(tensors[-1] + self.dropout(self.sublayer(*tensors)))class TransformerEncoderLayer(nn.Module): def __init__( self, dim_model: int = 512, num_heads: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, ): super().__init__() dim_k = dim_v = dim_model // num_heads self.attention = Residual( MultiHeadAttention(num_heads, dim_model, dim_k, dim_v), dimension=dim_model, dropout=dropout, ) self.feed_forward = Residual( feed_forward(dim_model, dim_feedforward), dimension=dim_model, dropout=dropout, ) def forward(self, src: Tensor) -> Tensor: src = self.attention(src, src, src) return self.feed_forward(src)class TransformerEncoder(nn.Module): def __init__( self, num_layers: int = 6, dim_model: int = 512, num_heads: int = 8, dim_feedforward: int = 2048, dropout: float = 0.1, ): super().__init__() self.layers = nn.ModuleList([ TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ]) def forward(self, src: Tensor) -> Tensor: seq_len, dimension = src.size(1), src.size(2) src += position_encoding(seq_len, dimension) for layer in self.layers: src = layer(src) return src
解碼器模組非常相似。只是一些小的區別:
解碼器接受兩個引數(target和memory),而不是一個;每層有兩個多頭部注意力模組,而不是一個;第二個多頭注意力接受兩個輸入的記憶;解碼器中包含了self-attention和cross-attention。class TransformerDecoderLayer(nn.Module): def __init__( self, dim_model: int = 512, num_heads: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, ): super().__init__() dim_k = dim_v = dim_model // num_heads self.attention_1 = Residual( MultiHeadAttention(num_heads, dim_model, dim_k, dim_v), dimension=dim_model, dropout=dropout, ) self.attention_2 = Residual( MultiHeadAttention(num_heads, dim_model, dim_k, dim_v), dimension=dim_model, dropout=dropout, ) self.feed_forward = Residual( feed_forward(dim_model, dim_feedforward), dimension=dim_model, dropout=dropout, ) def forward(self, tgt: Tensor, memory: Tensor) -> Tensor: tgt = self.attention_1(tgt, tgt, tgt) tgt = self.attention_2(memory, memory, tgt) return self.feed_forward(tgt)class TransformerDecoder(nn.Module): def __init__( self, num_layers: int = 6, dim_model: int = 512, num_heads: int = 8, dim_feedforward: int = 2048, dropout: float = 0.1, ): super().__init__() self.layers = nn.ModuleList([ TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout) for _ in range(num_layers) ]) self.linear = nn.Linear(dim_model, dim_model) def forward(self, tgt: Tensor, memory: Tensor) -> Tensor: seq_len, dimension = tgt.size(1), tgt.size(2) tgt += position_encoding(seq_len, dimension) for layer in self.layers: tgt = layer(tgt, memory) return torch.softmax(self.linear(tgt), dim=-1)
最後,我們需要將所有內容打包成一個Transformer類,只要把一個編碼器和解碼器放在一起,然後以正確的順序透過它們傳遞資料。
class Transformer(nn.Module): def __init__( self, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_model: int = 512, num_heads: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: nn.Module = nn.ReLU(), ): super().__init__() self.encoder = TransformerEncoder( num_layers=num_encoder_layers, dim_model=dim_model, num_heads=num_heads, dim_feedforward=dim_feedforward, dropout=dropout, ) self.decoder = TransformerDecoder( num_layers=num_decoder_layers, dim_model=dim_model, num_heads=num_heads, dim_feedforward=dim_feedforward, dropout=dropout, ) def forward(self, src: Tensor, tgt: Tensor) -> Tensor: return self.decoder(tgt, self.encoder(src))
讓我們建立一個簡單的測試,作為實現的健全性檢查。我們可以構造src和tgt的隨機張量,檢查我們的模型執行沒有錯誤,並確認輸出張量具有正確的形狀。
src = torch.rand(64, 16, 512)tgt = torch.rand(64, 16, 512)out = Transformer()(src, tgt)print(out.shape)# torch.Size([64, 16, 512])
Conclusions
希望這篇有助於瞭解transformer是如何搭建的,以及它們是如何工作的。計算機視覺領域,以前可能沒有遇到過這些模型,但DETR和ViT已經取得了突破性的成果,預計在未來幾年裡會看到更多這樣的模型。