首頁>技術>

賽題介紹

這是一個廣泛的探索性資料分析,招聘餐廳遊客預測競賽。這項挑戰的目的是預測未來餐館遊客的數量。這使得它成為一個時間序列預測問題。這些資料是從日本餐館收集的。正如我們將看到的,資料集很小,易於訪問,不需要太多的記憶體或計算能力。因此,這項比賽特別適合初學者。

這些資料以8個關係檔案的形式出現,它們來自兩個獨立的收集使用者資訊的日本網站:“Hot Pepper Gourmet (hpg):類似於Yelp(搜尋和預訂)”和“AirREGI / Restaurant Board (air):類似於Square(預訂控制和收銀機)”。訓練資料基於2016年1月- 2017年4月的大部分時間範圍,而測試集包括Apr的最後一週加上2017年5月。測試資料“有意跨越日本被稱為‘黃金週’的假日周”。資料描述進一步指出:“在測試集中,有幾天餐廳是關閉的,沒有訪客。這些在得分時被忽略。訓練集漏掉了餐館關門的日子。”

air_visit_data.csv:航空餐廳的歷史訪問資料。這是主要的訓練資料集

air_reserve.csv / hpg_reserve.csv: 透過air / hpg系統預訂

air_store_info.csv / hpg_store_info.csv: 關於air / hpg餐廳的詳細資訊,包括型別和地點。

store_id_relation.csv: 連線air和hpg id

date_info.csv: 基本上是日本節日的標誌。

sample_submission.csv: 作為測試集。id由air id和訪問日期組合而成。

比賽地址:https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/leaderboard

第一名程式碼解析(1)匯入使用的依賴包
import timeimport numpy as npimport pandas as pdfrom dateutil.parser import parsefrom datetime import date, timedeltafrom sklearn.preprocessing import LabelEncoder
## 自己資料儲存位置, 依據自己更改data_path = './recruit-restaurant-visitor-forecasting/'air_reserve = pd.read_csv(data_path + 'air_reserve.csv').rename(columns={'air_store_id':'store_id'})hpg_reserve = pd.read_csv(data_path + 'hpg_reserve.csv').rename(columns={'hpg_store_id':'store_id'})air_store = pd.read_csv(data_path + 'air_store_info.csv').rename(columns={'air_store_id':'store_id'})hpg_store = pd.read_csv(data_path + 'hpg_store_info.csv').rename(columns={'hpg_store_id':'store_id'})air_visit = pd.read_csv(data_path + 'air_visit_data.csv').rename(columns={'air_store_id':'store_id'})store_id_map = pd.read_csv(data_path + 'store_id_relation.csv').set_index('hpg_store_id',drop=False)date_info = pd.read_csv(data_path + 'date_info.csv').rename(columns={'calendar_date': 'visit_date'}).drop('day_of_week',axis=1)submission = pd.read_csv(data_path + 'sample_submission.csv')
(2)檢視具體資料

可以看到同一時間有多個預訂

hpg_reserve資料

air_store

hpg_store

air_visit

store_id_map

submission

(3)處理資料
## 根據上面資料的格式處理資料## 處理submission 資料獲取提交資料的id資訊submission['visit_date'] = submission['id'].str[-10:]submission['store_id'] = submission['id'].str[:-11]## air_reserve 資訊air_reserve['visit_date'] = air_reserve['visit_datetime'].str[:10]air_reserve['reserve_date'] = air_reserve['reserve_datetime'].str[:10]air_reserve['dow'] = pd.to_datetime(air_reserve['visit_date']).dt.dayofweekhpg_reserve['visit_date'] = hpg_reserve['visit_datetime'].str[:10]hpg_reserve['reserve_date'] = hpg_reserve['reserve_datetime'].str[:10]hpg_reserve['dow'] = pd.to_datetime(hpg_reserve['visit_date']).dt.dayofweekair_visit['id'] = air_visit['store_id'] + '_' + air_visit['visit_date']hpg_reserve['store_id'] = hpg_reserve['store_id'].map(store_id_map['air_store_id']).fillna(hpg_reserve['store_id'])hpg_store['store_id'] = hpg_store['store_id'].map(store_id_map['air_store_id']).fillna(hpg_store['store_id'])hpg_store.rename(columns={'hpg_genre_name':'air_genre_name','hpg_area_name':'air_area_name'},inplace=True)data = pd.concat([air_visit, submission]).copy()data['dow'] = pd.to_datetime(data['visit_date']).dt.dayofweek## 週末和節假日設定為1date_info['holiday_flg2'] = pd.to_datetime(date_info['visit_date']).dt.dayofweekdate_info['holiday_flg2'] = ((date_info['holiday_flg2']>4) | (date_info['holiday_flg']==1)).astype(int)
## 主要對餐廳編碼air_store['air_area_name0'] = air_store['air_area_name'].apply(lambda x: x.split(' ')[0])lbl = LabelEncoder()air_store['air_genre_name'] = lbl.fit_transform(air_store['air_genre_name'])air_store['air_area_name0'] = lbl.fit_transform(air_store['air_area_name0'])## 將標籤對數處理,進一步使資料分佈接近於正態分佈,該處理方式視情況而定data['visitors'] = np.log1p(data['visitors'])data = data.merge(air_store,on='store_id',how='left')data = data.merge(date_info[['visit_date','holiday_flg','holiday_flg2']], on=['visit_date'],how='left')

處理完資料如下所示:

def my_concat(L):    result = None    for l in L:        if result is None:            result = l        else:            try:                result[l.columns.tolist()] = l            except:                print(l.head())    return result## 並且只保留data2中的資料def left_merge(data1,data2,on):    if type(on) != list:        on = [on]    if (set(on) & set(data2.columns)) != set(on):        data2_temp = data2.reset_index()    else:        data2_temp = data2.copy()    columns = [f for f in data2.columns if f not in on]    result = data1.merge(data2_temp,on=on,how='left')    result = result[columns]    return result## 獲取相差天數def diff_of_days(day1, day2):    days = (parse(day1[:10]) - parse(day2[:10])).days    return days## 增加天數def date_add_days(start_date, days):    end_date = parse(start_date[:10]) + timedelta(days=days)    end_date = end_date.strftime('%Y-%m-%d')    return end_date## 獲得固定時間段內標籤def get_label(end_date,n_day):    label_end_date = date_add_days(end_date, n_day)    label = data[(data['visit_date'] < label_end_date) & (data['visit_date'] >= end_date)].copy()    label['end_date'] = end_date    label['diff_of_day'] = label['visit_date'].apply(lambda x: diff_of_days(x,end_date))    label['month'] = label['visit_date'].str[5:7].astype(int)    label['year'] = label['visit_date'].str[:4].astype(int)    for i in [3,2,1,-1]:        date_info_temp = date_info.copy()        date_info_temp['visit_date'] = date_info_temp['visit_date'].apply(lambda x: date_add_days(x,i))        date_info_temp.rename(columns={'holiday_flg':'ahead_holiday_{}'.format(i),'holiday_flg2':'ahead_holiday2_{}'.format(i)},inplace=True)        label = label.merge(date_info_temp, on=['visit_date'],how='left')    label = label.reset_index(drop=True)    return label
(4)特徵工程程式碼
## 標籤相關統計特徵def get_store_visitor_feat(label, key, n_day):    start_date = date_add_days(key[0],-n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result = data_temp.groupby(['store_id'], as_index=False)['visitors'].agg({'store_min{}'.format(n_day): 'min',                                                                             'store_mean{}'.format(n_day): 'mean',                                                                             'store_median{}'.format(n_day): 'median',                                                                             'store_max{}'.format(n_day): 'max',                                                                             'store_count{}'.format(n_day): 'count',                                                                             'store_std{}'.format(n_day): 'std',                                                                             'store_skew{}'.format(n_day): 'skew'})    result = left_merge(label, result, on=['store_id']).fillna(0)    return result## 指數特徵 例如過去100天 0.985**100*visitor100 + 。。。。 + 0.985 ** 1 * visitor1def get_store_exp_visitor_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x))    data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x)    data_temp['visitors'] = data_temp['visitors'] * data_temp['weight']    result1 = data_temp.groupby(['store_id'], as_index=False)['visitors'].agg({'store_exp_mean{}'.format(n_day): 'sum'})    result2 = data_temp.groupby(['store_id'], as_index=False)['weight'].agg({'store_exp_weight_sum{}'.format(n_day): 'sum'})    result = result1.merge(result2, on=['store_id'], how='left')    result['store_exp_mean{}'.format(n_day)] = result['store_exp_mean{}'.format(n_day)]/result['store_exp_weight_sum{}'.format(n_day)]    result = left_merge(label, result, on=['store_id']).fillna(0)    return result## 每週同期的特徵:如果今天為週一就提取歷史同期週一特徵def get_store_week_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors'].agg({'store_dow_min{}'.format(n_day): 'min',                                                                                     'store_dow_mean{}'.format(n_day): 'mean',                                                                                     'store_dow_median{}'.format(n_day): 'median',                                                                                     'store_dow_max{}'.format(n_day): 'max',                                                                                     'store_dow_count{}'.format(n_day): 'count',                                                                                     'store_dow_std{}'.format(n_day): 'std',                                                                                     'store_dow_skew{}'.format(n_day): 'skew'})    result = left_merge(label, result, on=['store_id', 'dow']).fillna(0)    return result## 差分特徵:同前一天的差值相關特徵def get_store_week_diff_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result = data_temp.set_index(['store_id','visit_date'])['visitors'].unstack()    result = result.diff(axis=1).iloc[:,1:]    c = result.columns    result['store_diff_mean'] = np.abs(result[c]).mean(axis=1)    result['store_diff_std'] = result[c].std(axis=1)    result['store_diff_max'] = result[c].max(axis=1)    result['store_diff_min'] = result[c].min(axis=1)    result = left_merge(label, result[['store_diff_mean', 'store_diff_std', 'store_diff_max', 'store_diff_min']],on=['store_id']).fillna(0)    return result## 和上面get_store_week_feat類似。暫時沒發現有什麼不同只是時間段不同嗎?def get_store_all_week_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result_temp = data_temp.groupby(['store_id', 'dow'],as_index=False)['visitors'].agg({'store_dow_mean{}'.format(n_day): 'mean',                                                                     'store_dow_median{}'.format(n_day): 'median',                                                                     'store_dow_sum{}'.format(n_day): 'max',                                                                     'store_dow_count{}'.format(n_day): 'count'})    result = pd.DataFrame()    for i in range(7):        result_sub = result_temp[result_temp['dow']==i].copy()        result_sub = result_sub.set_index('store_id')        result_sub = result_sub.add_prefix(str(i))        result_sub = left_merge(label, result_sub, on=['store_id']).fillna(0)        result = pd.concat([result,result_sub],axis=1)    return result## 和上面的區別在於是提取同為周幾的特徵,並增加更多係數[0.9,0.95,0.97,0.98,0.985,0.99,0.999,0.9999]def get_store_week_exp_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x))    data_temp['visitors2'] = data_temp['visitors']    result = None    for i in [0.9,0.95,0.97,0.98,0.985,0.99,0.999,0.9999]:        data_temp['weight'] = data_temp['visit_date'].apply(lambda x: i**x)        data_temp['visitors1'] = data_temp['visitors'] * data_temp['weight']        data_temp['visitors2'] = data_temp['visitors2'] * data_temp['weight']        result1 = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors1'].agg({'store_dow_exp_mean{}_{}'.format(n_day,i): 'sum'})        result3 = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors2'].agg({'store_dow_exp_mean2{}_{}'.format(n_day, i): 'sum'})        result2 = data_temp.groupby(['store_id', 'dow'], as_index=False)['weight'].agg({'store_dow_exp_weight_sum{}_{}'.format(n_day,i): 'sum'})        result_temp = result1.merge(result2, on=['store_id', 'dow'], how='left')        result_temp = result_temp.merge(result3, on=['store_id', 'dow'], how='left')        result_temp['store_dow_exp_mean{}_{}'.format(n_day,i)] = result_temp['store_dow_exp_mean{}_{}'.format(n_day,i)]/result_temp['store_dow_exp_weight_sum{}_{}'.format(n_day,i)]        result_temp['store_dow_exp_mean2{}_{}'.format(n_day, i)] = result_temp[ 'store_dow_exp_mean2{}_{}'.format(n_day, i)]/result_temp['store_dow_exp_weight_sum{}_{}'.format(n_day, i)]        if result is None:            result = result_temp        else:            result = result.merge(result_temp,on=['store_id','dow'],how='left')    result = left_merge(label, result, on=['store_id', 'dow']).fillna(0)    return result##獲取節假日的特徵def get_store_holiday_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result1 = data_temp.groupby(['store_id', 'holiday_flg'], as_index=False)['visitors'].agg(        {'store_holiday_min{}'.format(n_day): 'min',         'store_holiday_mean{}'.format(n_day): 'mean',         'store_holiday_median{}'.format(n_day): 'median',         'store_holiday_max{}'.format(n_day): 'max',         'store_holiday_count{}'.format(n_day): 'count',         'store_holiday_std{}'.format(n_day): 'std',         'store_holiday_skew{}'.format(n_day): 'skew'})    result1 = left_merge(label, result1, on=['store_id', 'holiday_flg']).fillna(0)    result2 = data_temp.groupby(['store_id', 'holiday_flg2'], as_index=False)['visitors'].agg(        {'store_holiday2_min{}'.format(n_day): 'min',         'store_holiday2_mean{}'.format(n_day): 'mean',         'store_holiday2_median{}'.format(n_day): 'median',         'store_holiday2_max{}'.format(n_day): 'max',         'store_holiday2_count{}'.format(n_day): 'count',         'store_holiday2_std{}'.format(n_day): 'std',         'store_holiday2_skew{}'.format(n_day): 'skew'})    result2 = left_merge(label, result2, on=['store_id', 'holiday_flg2']).fillna(0)    result = pd.concat([result1, result2], axis=1)    return result### 下面特徵和上面類似只是處理不同資料##  獲取genre特徵def get_genre_visitor_feat(label, key, n_day):    start_date = date_add_days(key[0],-n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result = data_temp.groupby(['air_genre_name'], as_index=False)['visitors'].agg({'genre_min{}'.format(n_day): 'min',                                                                             'genre_mean{}'.format(n_day): 'mean',                                                                             'genre_median{}'.format(n_day): 'median',                                                                             'genre_max{}'.format(n_day): 'max',                                                                             'genre_count{}'.format(n_day): 'count',                                                                             'genre_std{}'.format(n_day): 'std',                                                                             'genre_skew{}'.format(n_day): 'skew'})    result = left_merge(label, result, on=['air_genre_name']).fillna(0)    return resultdef get_genre_exp_visitor_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x))    data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x)    data_temp['visitors'] = data_temp['visitors'] * data_temp['weight']    result1 = data_temp.groupby(['air_genre_name'], as_index=False)['visitors'].agg({'genre_exp_mean{}'.format(n_day): 'sum'})    result2 = data_temp.groupby(['air_genre_name'], as_index=False)['weight'].agg({'genre_exp_weight_sum{}'.format(n_day): 'sum'})    result = result1.merge(result2, on=['air_genre_name'], how='left')    result['genre_exp_mean{}'.format(n_day)] = result['genre_exp_mean{}'.format(n_day)]/result['genre_exp_weight_sum{}'.format(n_day)]    result = left_merge(label, result, on=['air_genre_name']).fillna(0)    return resultdef get_genre_week_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    result = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['visitors'].agg({'genre_dow_min{}'.format(n_day): 'min',                                                                                         'genre_dow_mean{}'.format(n_day): 'mean',                                                                                         'genre_dow_median{}'.format(n_day): 'median',                                                                                         'genre_dow_max{}'.format(n_day): 'max',                                                                                         'genre_dow_count{}'.format(n_day): 'count',                                                                                         'genre_dow_std{}'.format(n_day): 'std',                                                                                         'genre_dow_skew{}'.format(n_day): 'skew'})    result = left_merge(label, result, on=['air_genre_name', 'dow']).fillna(0)    return resultdef get_genre_week_exp_feat(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x))    data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x)    data_temp['visitors'] = data_temp['visitors'] * data_temp['weight']    result1 = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['visitors'].agg({'genre_dow_exp_mean{}'.format(n_day): 'sum'})    result2 = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['weight'].agg({'genre_dow_exp_weight_sum{}'.format(n_day): 'sum'})    result = result1.merge(result2, on=['air_genre_name', 'dow'], how='left')    result['genre_dow_exp_mean{}'.format(n_day)] = result['genre_dow_exp_mean{}'.format(n_day)]/result['genre_dow_exp_weight_sum{}'.format(n_day)]    result = left_merge(label, result, on=['air_genre_name', 'dow']).fillna(0)    return resultdef get_first_last_time(label, key, n_day):    start_date = date_add_days(key[0], -n_day)    data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy()    data_temp = data_temp.sort_values('visit_date')    result = data_temp.groupby('store_id')['visit_date'].agg({'first_time':lambda x: diff_of_days(key[0],np.min(x)),                                                              'last_time':lambda x: diff_of_days(key[0],np.max(x)),})    result = left_merge(label, result, on=['store_id']).fillna(0)    return result# air_reserve 相關特徵def get_reserve_feat(label,key):    label_end_date = date_add_days(key[0], key[1])    air_reserve_temp = air_reserve[(air_reserve.visit_date >= key[0]) &             # key[0] 是'2017-04-23'                                   (air_reserve.visit_date < label_end_date) &      # label_end_date 是'2017-05-31'                                   (air_reserve.reserve_date < key[0])].copy()    air_reserve_temp = air_reserve_temp.merge(air_store,on='store_id',how='left')    air_reserve_temp['diff_time'] = (pd.to_datetime(air_reserve['visit_datetime'])-pd.to_datetime(air_reserve['reserve_datetime'])).dt.days    air_reserve_temp = air_reserve_temp.merge(air_store,on='store_id')    air_result = air_reserve_temp.groupby(['store_id', 'visit_date'])['reserve_visitors'].agg(        {'air_reserve_visitors': 'sum',         'air_reserve_count': 'count'})    air_store_diff_time_mean = air_reserve_temp.groupby(['store_id', 'visit_date'])['diff_time'].agg(        {'air_store_diff_time_mean': 'mean'})    air_diff_time_mean = air_reserve_temp.groupby(['visit_date'])['diff_time'].agg(        {'air_diff_time_mean': 'mean'})    air_result = air_result.unstack().fillna(0).stack()    air_date_result = air_reserve_temp.groupby(['visit_date'])['reserve_visitors'].agg({        'air_date_visitors': 'sum',        'air_date_count': 'count'})    hpg_reserve_temp = hpg_reserve[(hpg_reserve.visit_date >= key[0]) & (hpg_reserve.visit_date < label_end_date) & (hpg_reserve.reserve_date < key[0])].copy()    hpg_reserve_temp['diff_time'] = (pd.to_datetime(hpg_reserve['visit_datetime']) - pd.to_datetime(hpg_reserve['reserve_datetime'])).dt.days    hpg_result = hpg_reserve_temp.groupby(['store_id', 'visit_date'])['reserve_visitors'].agg({'hpg_reserve_visitors': 'sum',                                                                                               'hpg_reserve_count': 'count'})    hpg_result = hpg_result.unstack().fillna(0).stack()    hpg_date_result = hpg_reserve_temp.groupby(['visit_date'])['reserve_visitors'].agg({        'hpg_date_visitors': 'sum',        'hpg_date_count': 'count'})    hpg_store_diff_time_mean = hpg_reserve_temp.groupby(['store_id', 'visit_date'])['diff_time'].agg(        {'hpg_store_diff_time_mean': 'mean'})    hpg_diff_time_mean = hpg_reserve_temp.groupby(['visit_date'])['diff_time'].agg(        {'hpg_diff_time_mean': 'mean'})    air_result = left_merge(label, air_result, on=['store_id','visit_date']).fillna(0)    air_store_diff_time_mean = left_merge(label, air_store_diff_time_mean, on=['store_id', 'visit_date']).fillna(0)    hpg_result = left_merge(label, hpg_result, on=['store_id', 'visit_date']).fillna(0)    hpg_store_diff_time_mean = left_merge(label, hpg_store_diff_time_mean, on=['store_id', 'visit_date']).fillna(0)    air_date_result = left_merge(label, air_date_result, on=['visit_date']).fillna(0)    air_diff_time_mean = left_merge(label, air_diff_time_mean, on=['visit_date']).fillna(0)    hpg_date_result = left_merge(label, hpg_date_result, on=['visit_date']).fillna(0)    hpg_diff_time_mean = left_merge(label, hpg_diff_time_mean, on=['visit_date']).fillna(0)    result = pd.concat([air_result,hpg_result,air_date_result,hpg_date_result,air_store_diff_time_mean,                        hpg_store_diff_time_mean,air_diff_time_mean,hpg_diff_time_mean],axis=1)    return result# second featuredef second_feat(result):    result['store_mean_14_28_rate'] = result['store_mean14']/(result['store_mean28']+0.01)    result['store_mean_28_56_rate'] = result['store_mean28'] / (result['store_mean56'] + 0.01)    result['store_mean_56_1000_rate'] = result['store_mean56'] / (result['store_mean1000'] + 0.01)    result['genre_mean_28_56_rate'] = result['genre_mean28'] / (result['genre_mean56'] + 0.01)    result['sgenre_mean_56_1000_rate'] = result['genre_mean56'] / (result['genre_mean1000'] + 0.01)    return result
(5) 提取特徵並訓練
# import pdb def make_feats(end_date,n_day):    t0 = time.time()    key = end_date,n_day    print('data key為:{}'.format(key))    print('add label')    label = get_label(end_date,n_day)    print('make feature...')    result = [label]    result.append(get_store_visitor_feat(label, key, 1000))        # store features#     pdb.set_trace()    result.append(get_store_visitor_feat(label, key, 56))          # store features    result.append(get_store_visitor_feat(label, key, 28))          # store features    result.append(get_store_visitor_feat(label, key, 14))          # store features    result.append(get_store_exp_visitor_feat(label, key, 1000))    # store exp features    result.append(get_store_week_feat(label, key, 1000))           # store dow features    result.append(get_store_week_feat(label, key, 56))             # store dow features    result.append(get_store_week_feat(label, key, 28))             # store dow features    result.append(get_store_week_feat(label, key, 14))             # store dow features    result.append(get_store_week_diff_feat(label, key, 58))       # store dow diff features    result.append(get_store_week_diff_feat(label, key, 1000))      # store dow diff features    result.append(get_store_all_week_feat(label, key, 1000))       # store all week feat    result.append(get_store_week_exp_feat(label, key, 1000))       # store dow exp feat    result.append(get_store_holiday_feat(label, key, 1000))        # store holiday feat    result.append(get_genre_visitor_feat(label, key, 1000))         # genre feature    result.append(get_genre_visitor_feat(label, key, 56))           # genre feature    result.append(get_genre_visitor_feat(label, key, 28))           # genre feature    result.append(get_genre_exp_visitor_feat(label, key, 1000))     # genre feature    result.append(get_genre_week_feat(label, key, 1000))            # genre dow feature    result.append(get_genre_week_feat(label, key, 56))              # genre dow feature    result.append(get_genre_week_feat(label, key, 28))              # genre dow feature    result.append(get_genre_week_exp_feat(label, key, 1000))        # genre dow exp feature    result.append(get_reserve_feat(label,key))                      # air_reserve    result.append(get_first_last_time(label,key,1000))             # first time and last time    result.append(label)    print('merge...')    result = my_concat(result)    result = second_feat(result)    print('data shape:{}'.format(result.shape))    print('spending {}s'.format(time.time() - t0))    return resulttrain_feat = pd.DataFrame()start_date = '2017-03-12'for i in range(58):    train_feat_sub = make_feats(date_add_days(start_date, i*(-7)),39)    train_feat = pd.concat([train_feat,train_feat_sub])for i in range(1,6):    train_feat_sub = make_feats(date_add_days(start_date,i*(7)),42-(i*7))    train_feat = pd.concat([train_feat,train_feat_sub])test_feat = make_feats(date_add_days(start_date, 42),39)predictors = [f for f in test_feat.columns if f not in (['id','store_id','visit_date','end_date','air_area_name','visitors','month'])]import datetimeimport lightgbm as lgbparams = {    'learning_rate': 0.02,    'boosting_type': 'gbdt',    'objective': 'regression',    'metric': 'rmse',    'sub_feature': 0.7,    'num_leaves': 60,    'min_data': 100,    'min_hessian': 1,    'verbose': -1,}t0 = time.time()lgb_train = lgb.Dataset(train_feat[predictors], train_feat['visitors'])lgb_test = lgb.Dataset(test_feat[predictors], test_feat['visitors'])gbm = lgb.train(params,lgb_train,2300,verbose_eval=100)pred = gbm.predict(test_feat[predictors])print('訓練用時{}秒'.format(time.time() - t0))subm = pd.DataFrame({'id':test_feat.store_id + '_' + test_feat.visit_date,'visitors':np.expm1(pred)})subm = submission[['id']].merge(subm,on='id',how='left').fillna(0)subm.to_csv('./sub{}.csv'.format(datetime.datetime.now().strftime('%Y%m%d_%H%M%S')),                  index=False,  float_format='%.4f')
(6)kaggle提交結果:大概在20名左右,可能需要調參,感覺這是作者私榜的最終程式碼。

總榜結果

總結

(1) 作者將能提取的特徵都提取了,包括歷史同期趨勢特徵、差分特徵和一些相關統計特徵。但是感覺缺少和歷史同期的差分特徵即同為週三的差值(後續有時間加進去看看結果)

(2)作者處理資料獲取標籤的方法值得學習,並沒有將最後不夠一個週期的資料剔除掉。下圖為作者的構建方法。

(3)沒有使用比賽的評價指標作為最佳化函式目標函式訓練,可能會有結果提升。

3
最新評論
  • BSA-TRITC(10mg/ml) TRITC-BSA 牛血清白蛋白改性標記羅丹明
  • Python基礎 列表詳解(純乾貨)