在NLP中結合文字和數字特徵進行機器學習

首頁>技術>deephub2020-12-25 09:01

在NLP中結合文字和數字特徵進行機器學習

應用於自然語言處理的機器學習資料通常包含文字和數字輸入。例如，當您透過twitter或新聞構建一個模型來預測產品未來的銷售時，在考慮文字的同時考慮過去的銷售資料、訪問者數量、市場趨勢等將會更有效。您不會僅僅根據新聞情緒來預測股價的波動，而是會利用它來補充基於經濟指標和歷史價格的模型。這篇文章展示瞭如何在scikit-learn（對於Tfidf）和pytorch（對於LSTM / BERT）中組合文字輸入和數字輸入。

scikit-learn(例如用於Tfidf)

當你有一個包含數字欄位和文字的訓練dataframe ，並應用一個來自scikit-lean或其他等價的簡單模型時，最簡單的方法之一是使用sklearn.pipeline的FeatureUnion管道。

下面的示例假定X_train是一個dataframe ，它由許多數字欄位和最後一列的文字欄位組成。然後，您可以建立一個FunctionTransformer來分隔數字列和文字列。傳遞給這個FunctionTransformer的函式可以是任何東西，因此請根據輸入資料修改它。這裡它只返回最後一列作為文字特性，其餘的作為數字特性。然後在文字上應用Tfidf向量化並輸入分類器。該樣本使用RandomForest作為估計器，並使用GridSearchCV在給定引數中搜索最佳模型，但它可以是其他任何引數。

import numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import FunctionTransformerfrom sklearn.model_selection import GridSearchCV, StratifiedKFold# Create Function Transformer to use Feature Uniondef get_numeric_data(x):    return [record[:-2].astype(float) for record in x]def get_text_data(x):    return [record[-1] for record in x]transfomer_numeric = FunctionTransformer(get_numeric_data)transformer_text = FunctionTransformer(get_text_data)# Create a pipeline to concatenate Tfidf Vector and Numeric data# Use RandomForestClassifier as an examplepipeline = Pipeline([    ('features', FeatureUnion([            ('numeric_features', Pipeline([                ('selector', transfomer_numeric)            ])),             ('text_features', Pipeline([                ('selector', transformer_text),                ('vec', TfidfVectorizer(analyzer='word'))            ]))         ])),    ('clf', RandomForestClassifier())])# Grid Search Parameters for RandomForestparam_grid = {'clf__n_estimators': np.linspace(1, 100, 10, dtype=int),              'clf__min_samples_split': [3, 10],              'clf__min_samples_leaf': [3],              'clf__max_features': [7],              'clf__max_depth': [None],              'clf__criterion': ['gini'],              'clf__bootstrap': [False]}# Training configkfold = StratifiedKFold(n_splits=7)scoring = {'Accuracy': 'accuracy', 'F1': 'f1_macro'}refit = 'F1'# Perform GridSearchrf_model = GridSearchCV(pipeline, param_grid=param_grid, cv=kfold, scoring=scoring,                          refit=refit, n_jobs=-1, return_train_score=True, verbose=1)rf_model.fit(X_train, Y_train)rf_best = rf_model.best_estimator_

Scikit-learn提供了很好的api來管理ML管道，它只完成工作，還可以以同樣的方式執行更復雜的步驟。

Pytorch(例如LSTM, BERT)

如果您應用深度神經網路，更常見的是使用Tensorflow/Keras或Pytorch來定義層。兩者都有類似的api，並且可以以相同的方式組合文字和數字輸入，下面的示例使用pytorch。

要在神經網路中處理文字，首先它應該以模型所期望的方式嵌入。有一個dropout 層也是常見的，以避免過擬合。該模型在與數字特徵連線之前新增一個稠密層(即全連線層)，以平衡特徵的數量。最後，應用稠密層輸出所需的輸出數量。

class LSTMTextClassifier(nn.Module):  def __init__(self, vocab_size, embed_size, lstm_size, dense_size, numeric_feature_size, output_size, lstm_layers=1, dropout=0.1):        super().__init__()        self.vocab_size = vocab_size        self.embed_size = embed_size        self.lstm_size = lstm_size        self.output_size = output_size        self.lstm_layers = lstm_layers        self.dropout = dropout        self.embedding = nn.Embedding(vocab_size, embed_size)        self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers, dropout=dropout, batch_first=False)        self.dropout = nn.Dropout(0.2)        self.fc1 = nn.Linear(lstm_size, dense_size)        self.fc2 = nn.Linear(dense_size + numeric_feature_size, output_size)        self.softmax = nn.LogSoftmax(dim=1)    def init_hidden(self, batch_size):        weight = next(self.parameters()).data        hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),                  weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())                return hidden    def forward(self, nn_input_text, nn_input_meta, hidden_state):        batch_size = nn_input_text.size(0)        nn_input_text = nn_input_text.long()        embeds = self.embedding(nn_input_text)        lstm_out, hidden_state = self.lstm(embeds, hidden_state)        lstm_out = lstm_out[-1,:,:]        lstm_out = self.dropout(lstm_out)        dense_out = self.fc1(lstm_out)        concat_layer = torch.cat((dense_out, nn_input_meta.float()), 1)        out = self.fc2(concat_layer)        logps = self.softmax(out)        return logps, hidden_state      class BertTextClassifier(nn.Module):    def __init__(self, hidden_size, dense_size, numeric_feature_size, output_size, dropout=0.1):        super().__init__()        self.output_size = output_size        self.dropout = dropout                # Use pre-trained BERT model        self.bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)        for param in self.bert.parameters():            param.requires_grad = True        self.weights = nn.Parameter(torch.rand(13, 1))        self.dropout = nn.Dropout(dropout)        self.fc1 = nn.Linear(hidden_size, dense_size)        self.fc2 = nn.Linear(dense_size + numeric_feature_size, output_size)        self.softmax = nn.LogSoftmax(dim=1)    def forward(self, input_ids, nn_input_meta):                all_hidden_states, all_attentions = self.bert(input_ids)[-2:]        batch_size = input_ids.shape[0]        ht_cls = torch.cat(all_hidden_states)[:, :1, :].view(13, batch_size, 1, 768)        atten = torch.sum(ht_cls * self.weights.view(13, 1, 1, 1), dim=[1, 3])        atten = F.softmax(atten.view(-1), dim=0)        feature = torch.sum(ht_cls * atten.view(13, 1, 1, 1), dim=[0, 2])        dense_out = self.fc1(self.dropout(feature))        concat_layer = torch.cat((dense_out, nn_input_meta.float()), 1)        out = self.fc2(concat_layer)        logps = self.softmax(out)        return logps

以上程式碼在前向傳播時使用torch.cat將數字特徵和文字特徵進行組合，並輸入到後續的分類器中進行處理。

最新評論

劇多

在NLP中結合文字和數字特徵進行機器學習

相關內容