賽題介紹
這是一個廣泛的探索性資料分析,招聘餐廳遊客預測競賽。這項挑戰的目的是預測未來餐館遊客的數量。這使得它成為一個時間序列預測問題。這些資料是從日本餐館收集的。正如我們將看到的,資料集很小,易於訪問,不需要太多的記憶體或計算能力。因此,這項比賽特別適合初學者。
這些資料以8個關係檔案的形式出現,它們來自兩個獨立的收集使用者資訊的日本網站:“Hot Pepper Gourmet (hpg):類似於Yelp(搜尋和預訂)”和“AirREGI / Restaurant Board (air):類似於Square(預訂控制和收銀機)”。訓練資料基於2016年1月- 2017年4月的大部分時間範圍,而測試集包括Apr的最後一週加上2017年5月。測試資料“有意跨越日本被稱為‘黃金週’的假日周”。資料描述進一步指出:“在測試集中,有幾天餐廳是關閉的,沒有訪客。這些在得分時被忽略。訓練集漏掉了餐館關門的日子。”
air_visit_data.csv:航空餐廳的歷史訪問資料。這是主要的訓練資料集
air_reserve.csv / hpg_reserve.csv: 透過air / hpg系統預訂
air_store_info.csv / hpg_store_info.csv: 關於air / hpg餐廳的詳細資訊,包括型別和地點。
store_id_relation.csv: 連線air和hpg id
date_info.csv: 基本上是日本節日的標誌。
sample_submission.csv: 作為測試集。id由air id和訪問日期組合而成。
比賽地址:https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/leaderboard
第一名程式碼解析(1)匯入使用的依賴包import timeimport numpy as npimport pandas as pdfrom dateutil.parser import parsefrom datetime import date, timedeltafrom sklearn.preprocessing import LabelEncoder
## 自己資料儲存位置, 依據自己更改data_path = './recruit-restaurant-visitor-forecasting/'air_reserve = pd.read_csv(data_path + 'air_reserve.csv').rename(columns={'air_store_id':'store_id'})hpg_reserve = pd.read_csv(data_path + 'hpg_reserve.csv').rename(columns={'hpg_store_id':'store_id'})air_store = pd.read_csv(data_path + 'air_store_info.csv').rename(columns={'air_store_id':'store_id'})hpg_store = pd.read_csv(data_path + 'hpg_store_info.csv').rename(columns={'hpg_store_id':'store_id'})air_visit = pd.read_csv(data_path + 'air_visit_data.csv').rename(columns={'air_store_id':'store_id'})store_id_map = pd.read_csv(data_path + 'store_id_relation.csv').set_index('hpg_store_id',drop=False)date_info = pd.read_csv(data_path + 'date_info.csv').rename(columns={'calendar_date': 'visit_date'}).drop('day_of_week',axis=1)submission = pd.read_csv(data_path + 'sample_submission.csv')
(2)檢視具體資料可以看到同一時間有多個預訂
hpg_reserve資料
air_store
hpg_store
air_visit
store_id_map
submission
(3)處理資料## 根據上面資料的格式處理資料## 處理submission 資料獲取提交資料的id資訊submission['visit_date'] = submission['id'].str[-10:]submission['store_id'] = submission['id'].str[:-11]## air_reserve 資訊air_reserve['visit_date'] = air_reserve['visit_datetime'].str[:10]air_reserve['reserve_date'] = air_reserve['reserve_datetime'].str[:10]air_reserve['dow'] = pd.to_datetime(air_reserve['visit_date']).dt.dayofweekhpg_reserve['visit_date'] = hpg_reserve['visit_datetime'].str[:10]hpg_reserve['reserve_date'] = hpg_reserve['reserve_datetime'].str[:10]hpg_reserve['dow'] = pd.to_datetime(hpg_reserve['visit_date']).dt.dayofweekair_visit['id'] = air_visit['store_id'] + '_' + air_visit['visit_date']hpg_reserve['store_id'] = hpg_reserve['store_id'].map(store_id_map['air_store_id']).fillna(hpg_reserve['store_id'])hpg_store['store_id'] = hpg_store['store_id'].map(store_id_map['air_store_id']).fillna(hpg_store['store_id'])hpg_store.rename(columns={'hpg_genre_name':'air_genre_name','hpg_area_name':'air_area_name'},inplace=True)data = pd.concat([air_visit, submission]).copy()data['dow'] = pd.to_datetime(data['visit_date']).dt.dayofweek## 週末和節假日設定為1date_info['holiday_flg2'] = pd.to_datetime(date_info['visit_date']).dt.dayofweekdate_info['holiday_flg2'] = ((date_info['holiday_flg2']>4) | (date_info['holiday_flg']==1)).astype(int)
## 主要對餐廳編碼air_store['air_area_name0'] = air_store['air_area_name'].apply(lambda x: x.split(' ')[0])lbl = LabelEncoder()air_store['air_genre_name'] = lbl.fit_transform(air_store['air_genre_name'])air_store['air_area_name0'] = lbl.fit_transform(air_store['air_area_name0'])## 將標籤對數處理,進一步使資料分佈接近於正態分佈,該處理方式視情況而定data['visitors'] = np.log1p(data['visitors'])data = data.merge(air_store,on='store_id',how='left')data = data.merge(date_info[['visit_date','holiday_flg','holiday_flg2']], on=['visit_date'],how='left')
處理完資料如下所示:
def my_concat(L): result = None for l in L: if result is None: result = l else: try: result[l.columns.tolist()] = l except: print(l.head()) return result## 並且只保留data2中的資料def left_merge(data1,data2,on): if type(on) != list: on = [on] if (set(on) & set(data2.columns)) != set(on): data2_temp = data2.reset_index() else: data2_temp = data2.copy() columns = [f for f in data2.columns if f not in on] result = data1.merge(data2_temp,on=on,how='left') result = result[columns] return result## 獲取相差天數def diff_of_days(day1, day2): days = (parse(day1[:10]) - parse(day2[:10])).days return days## 增加天數def date_add_days(start_date, days): end_date = parse(start_date[:10]) + timedelta(days=days) end_date = end_date.strftime('%Y-%m-%d') return end_date## 獲得固定時間段內標籤def get_label(end_date,n_day): label_end_date = date_add_days(end_date, n_day) label = data[(data['visit_date'] < label_end_date) & (data['visit_date'] >= end_date)].copy() label['end_date'] = end_date label['diff_of_day'] = label['visit_date'].apply(lambda x: diff_of_days(x,end_date)) label['month'] = label['visit_date'].str[5:7].astype(int) label['year'] = label['visit_date'].str[:4].astype(int) for i in [3,2,1,-1]: date_info_temp = date_info.copy() date_info_temp['visit_date'] = date_info_temp['visit_date'].apply(lambda x: date_add_days(x,i)) date_info_temp.rename(columns={'holiday_flg':'ahead_holiday_{}'.format(i),'holiday_flg2':'ahead_holiday2_{}'.format(i)},inplace=True) label = label.merge(date_info_temp, on=['visit_date'],how='left') label = label.reset_index(drop=True) return label
(4)特徵工程程式碼## 標籤相關統計特徵def get_store_visitor_feat(label, key, n_day): start_date = date_add_days(key[0],-n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result = data_temp.groupby(['store_id'], as_index=False)['visitors'].agg({'store_min{}'.format(n_day): 'min', 'store_mean{}'.format(n_day): 'mean', 'store_median{}'.format(n_day): 'median', 'store_max{}'.format(n_day): 'max', 'store_count{}'.format(n_day): 'count', 'store_std{}'.format(n_day): 'std', 'store_skew{}'.format(n_day): 'skew'}) result = left_merge(label, result, on=['store_id']).fillna(0) return result## 指數特徵 例如過去100天 0.985**100*visitor100 + 。。。。 + 0.985 ** 1 * visitor1def get_store_exp_visitor_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x)) data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x) data_temp['visitors'] = data_temp['visitors'] * data_temp['weight'] result1 = data_temp.groupby(['store_id'], as_index=False)['visitors'].agg({'store_exp_mean{}'.format(n_day): 'sum'}) result2 = data_temp.groupby(['store_id'], as_index=False)['weight'].agg({'store_exp_weight_sum{}'.format(n_day): 'sum'}) result = result1.merge(result2, on=['store_id'], how='left') result['store_exp_mean{}'.format(n_day)] = result['store_exp_mean{}'.format(n_day)]/result['store_exp_weight_sum{}'.format(n_day)] result = left_merge(label, result, on=['store_id']).fillna(0) return result## 每週同期的特徵:如果今天為週一就提取歷史同期週一特徵def get_store_week_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors'].agg({'store_dow_min{}'.format(n_day): 'min', 'store_dow_mean{}'.format(n_day): 'mean', 'store_dow_median{}'.format(n_day): 'median', 'store_dow_max{}'.format(n_day): 'max', 'store_dow_count{}'.format(n_day): 'count', 'store_dow_std{}'.format(n_day): 'std', 'store_dow_skew{}'.format(n_day): 'skew'}) result = left_merge(label, result, on=['store_id', 'dow']).fillna(0) return result## 差分特徵:同前一天的差值相關特徵def get_store_week_diff_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result = data_temp.set_index(['store_id','visit_date'])['visitors'].unstack() result = result.diff(axis=1).iloc[:,1:] c = result.columns result['store_diff_mean'] = np.abs(result[c]).mean(axis=1) result['store_diff_std'] = result[c].std(axis=1) result['store_diff_max'] = result[c].max(axis=1) result['store_diff_min'] = result[c].min(axis=1) result = left_merge(label, result[['store_diff_mean', 'store_diff_std', 'store_diff_max', 'store_diff_min']],on=['store_id']).fillna(0) return result## 和上面get_store_week_feat類似。暫時沒發現有什麼不同只是時間段不同嗎?def get_store_all_week_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result_temp = data_temp.groupby(['store_id', 'dow'],as_index=False)['visitors'].agg({'store_dow_mean{}'.format(n_day): 'mean', 'store_dow_median{}'.format(n_day): 'median', 'store_dow_sum{}'.format(n_day): 'max', 'store_dow_count{}'.format(n_day): 'count'}) result = pd.DataFrame() for i in range(7): result_sub = result_temp[result_temp['dow']==i].copy() result_sub = result_sub.set_index('store_id') result_sub = result_sub.add_prefix(str(i)) result_sub = left_merge(label, result_sub, on=['store_id']).fillna(0) result = pd.concat([result,result_sub],axis=1) return result## 和上面的區別在於是提取同為周幾的特徵,並增加更多係數[0.9,0.95,0.97,0.98,0.985,0.99,0.999,0.9999]def get_store_week_exp_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x)) data_temp['visitors2'] = data_temp['visitors'] result = None for i in [0.9,0.95,0.97,0.98,0.985,0.99,0.999,0.9999]: data_temp['weight'] = data_temp['visit_date'].apply(lambda x: i**x) data_temp['visitors1'] = data_temp['visitors'] * data_temp['weight'] data_temp['visitors2'] = data_temp['visitors2'] * data_temp['weight'] result1 = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors1'].agg({'store_dow_exp_mean{}_{}'.format(n_day,i): 'sum'}) result3 = data_temp.groupby(['store_id', 'dow'], as_index=False)['visitors2'].agg({'store_dow_exp_mean2{}_{}'.format(n_day, i): 'sum'}) result2 = data_temp.groupby(['store_id', 'dow'], as_index=False)['weight'].agg({'store_dow_exp_weight_sum{}_{}'.format(n_day,i): 'sum'}) result_temp = result1.merge(result2, on=['store_id', 'dow'], how='left') result_temp = result_temp.merge(result3, on=['store_id', 'dow'], how='left') result_temp['store_dow_exp_mean{}_{}'.format(n_day,i)] = result_temp['store_dow_exp_mean{}_{}'.format(n_day,i)]/result_temp['store_dow_exp_weight_sum{}_{}'.format(n_day,i)] result_temp['store_dow_exp_mean2{}_{}'.format(n_day, i)] = result_temp[ 'store_dow_exp_mean2{}_{}'.format(n_day, i)]/result_temp['store_dow_exp_weight_sum{}_{}'.format(n_day, i)] if result is None: result = result_temp else: result = result.merge(result_temp,on=['store_id','dow'],how='left') result = left_merge(label, result, on=['store_id', 'dow']).fillna(0) return result##獲取節假日的特徵def get_store_holiday_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result1 = data_temp.groupby(['store_id', 'holiday_flg'], as_index=False)['visitors'].agg( {'store_holiday_min{}'.format(n_day): 'min', 'store_holiday_mean{}'.format(n_day): 'mean', 'store_holiday_median{}'.format(n_day): 'median', 'store_holiday_max{}'.format(n_day): 'max', 'store_holiday_count{}'.format(n_day): 'count', 'store_holiday_std{}'.format(n_day): 'std', 'store_holiday_skew{}'.format(n_day): 'skew'}) result1 = left_merge(label, result1, on=['store_id', 'holiday_flg']).fillna(0) result2 = data_temp.groupby(['store_id', 'holiday_flg2'], as_index=False)['visitors'].agg( {'store_holiday2_min{}'.format(n_day): 'min', 'store_holiday2_mean{}'.format(n_day): 'mean', 'store_holiday2_median{}'.format(n_day): 'median', 'store_holiday2_max{}'.format(n_day): 'max', 'store_holiday2_count{}'.format(n_day): 'count', 'store_holiday2_std{}'.format(n_day): 'std', 'store_holiday2_skew{}'.format(n_day): 'skew'}) result2 = left_merge(label, result2, on=['store_id', 'holiday_flg2']).fillna(0) result = pd.concat([result1, result2], axis=1) return result### 下面特徵和上面類似只是處理不同資料## 獲取genre特徵def get_genre_visitor_feat(label, key, n_day): start_date = date_add_days(key[0],-n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result = data_temp.groupby(['air_genre_name'], as_index=False)['visitors'].agg({'genre_min{}'.format(n_day): 'min', 'genre_mean{}'.format(n_day): 'mean', 'genre_median{}'.format(n_day): 'median', 'genre_max{}'.format(n_day): 'max', 'genre_count{}'.format(n_day): 'count', 'genre_std{}'.format(n_day): 'std', 'genre_skew{}'.format(n_day): 'skew'}) result = left_merge(label, result, on=['air_genre_name']).fillna(0) return resultdef get_genre_exp_visitor_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x)) data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x) data_temp['visitors'] = data_temp['visitors'] * data_temp['weight'] result1 = data_temp.groupby(['air_genre_name'], as_index=False)['visitors'].agg({'genre_exp_mean{}'.format(n_day): 'sum'}) result2 = data_temp.groupby(['air_genre_name'], as_index=False)['weight'].agg({'genre_exp_weight_sum{}'.format(n_day): 'sum'}) result = result1.merge(result2, on=['air_genre_name'], how='left') result['genre_exp_mean{}'.format(n_day)] = result['genre_exp_mean{}'.format(n_day)]/result['genre_exp_weight_sum{}'.format(n_day)] result = left_merge(label, result, on=['air_genre_name']).fillna(0) return resultdef get_genre_week_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() result = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['visitors'].agg({'genre_dow_min{}'.format(n_day): 'min', 'genre_dow_mean{}'.format(n_day): 'mean', 'genre_dow_median{}'.format(n_day): 'median', 'genre_dow_max{}'.format(n_day): 'max', 'genre_dow_count{}'.format(n_day): 'count', 'genre_dow_std{}'.format(n_day): 'std', 'genre_dow_skew{}'.format(n_day): 'skew'}) result = left_merge(label, result, on=['air_genre_name', 'dow']).fillna(0) return resultdef get_genre_week_exp_feat(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() data_temp['visit_date'] = data_temp['visit_date'].apply(lambda x: diff_of_days(key[0],x)) data_temp['weight'] = data_temp['visit_date'].apply(lambda x: 0.985**x) data_temp['visitors'] = data_temp['visitors'] * data_temp['weight'] result1 = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['visitors'].agg({'genre_dow_exp_mean{}'.format(n_day): 'sum'}) result2 = data_temp.groupby(['air_genre_name', 'dow'], as_index=False)['weight'].agg({'genre_dow_exp_weight_sum{}'.format(n_day): 'sum'}) result = result1.merge(result2, on=['air_genre_name', 'dow'], how='left') result['genre_dow_exp_mean{}'.format(n_day)] = result['genre_dow_exp_mean{}'.format(n_day)]/result['genre_dow_exp_weight_sum{}'.format(n_day)] result = left_merge(label, result, on=['air_genre_name', 'dow']).fillna(0) return resultdef get_first_last_time(label, key, n_day): start_date = date_add_days(key[0], -n_day) data_temp = data[(data.visit_date < key[0]) & (data.visit_date > start_date)].copy() data_temp = data_temp.sort_values('visit_date') result = data_temp.groupby('store_id')['visit_date'].agg({'first_time':lambda x: diff_of_days(key[0],np.min(x)), 'last_time':lambda x: diff_of_days(key[0],np.max(x)),}) result = left_merge(label, result, on=['store_id']).fillna(0) return result# air_reserve 相關特徵def get_reserve_feat(label,key): label_end_date = date_add_days(key[0], key[1]) air_reserve_temp = air_reserve[(air_reserve.visit_date >= key[0]) & # key[0] 是'2017-04-23' (air_reserve.visit_date < label_end_date) & # label_end_date 是'2017-05-31' (air_reserve.reserve_date < key[0])].copy() air_reserve_temp = air_reserve_temp.merge(air_store,on='store_id',how='left') air_reserve_temp['diff_time'] = (pd.to_datetime(air_reserve['visit_datetime'])-pd.to_datetime(air_reserve['reserve_datetime'])).dt.days air_reserve_temp = air_reserve_temp.merge(air_store,on='store_id') air_result = air_reserve_temp.groupby(['store_id', 'visit_date'])['reserve_visitors'].agg( {'air_reserve_visitors': 'sum', 'air_reserve_count': 'count'}) air_store_diff_time_mean = air_reserve_temp.groupby(['store_id', 'visit_date'])['diff_time'].agg( {'air_store_diff_time_mean': 'mean'}) air_diff_time_mean = air_reserve_temp.groupby(['visit_date'])['diff_time'].agg( {'air_diff_time_mean': 'mean'}) air_result = air_result.unstack().fillna(0).stack() air_date_result = air_reserve_temp.groupby(['visit_date'])['reserve_visitors'].agg({ 'air_date_visitors': 'sum', 'air_date_count': 'count'}) hpg_reserve_temp = hpg_reserve[(hpg_reserve.visit_date >= key[0]) & (hpg_reserve.visit_date < label_end_date) & (hpg_reserve.reserve_date < key[0])].copy() hpg_reserve_temp['diff_time'] = (pd.to_datetime(hpg_reserve['visit_datetime']) - pd.to_datetime(hpg_reserve['reserve_datetime'])).dt.days hpg_result = hpg_reserve_temp.groupby(['store_id', 'visit_date'])['reserve_visitors'].agg({'hpg_reserve_visitors': 'sum', 'hpg_reserve_count': 'count'}) hpg_result = hpg_result.unstack().fillna(0).stack() hpg_date_result = hpg_reserve_temp.groupby(['visit_date'])['reserve_visitors'].agg({ 'hpg_date_visitors': 'sum', 'hpg_date_count': 'count'}) hpg_store_diff_time_mean = hpg_reserve_temp.groupby(['store_id', 'visit_date'])['diff_time'].agg( {'hpg_store_diff_time_mean': 'mean'}) hpg_diff_time_mean = hpg_reserve_temp.groupby(['visit_date'])['diff_time'].agg( {'hpg_diff_time_mean': 'mean'}) air_result = left_merge(label, air_result, on=['store_id','visit_date']).fillna(0) air_store_diff_time_mean = left_merge(label, air_store_diff_time_mean, on=['store_id', 'visit_date']).fillna(0) hpg_result = left_merge(label, hpg_result, on=['store_id', 'visit_date']).fillna(0) hpg_store_diff_time_mean = left_merge(label, hpg_store_diff_time_mean, on=['store_id', 'visit_date']).fillna(0) air_date_result = left_merge(label, air_date_result, on=['visit_date']).fillna(0) air_diff_time_mean = left_merge(label, air_diff_time_mean, on=['visit_date']).fillna(0) hpg_date_result = left_merge(label, hpg_date_result, on=['visit_date']).fillna(0) hpg_diff_time_mean = left_merge(label, hpg_diff_time_mean, on=['visit_date']).fillna(0) result = pd.concat([air_result,hpg_result,air_date_result,hpg_date_result,air_store_diff_time_mean, hpg_store_diff_time_mean,air_diff_time_mean,hpg_diff_time_mean],axis=1) return result# second featuredef second_feat(result): result['store_mean_14_28_rate'] = result['store_mean14']/(result['store_mean28']+0.01) result['store_mean_28_56_rate'] = result['store_mean28'] / (result['store_mean56'] + 0.01) result['store_mean_56_1000_rate'] = result['store_mean56'] / (result['store_mean1000'] + 0.01) result['genre_mean_28_56_rate'] = result['genre_mean28'] / (result['genre_mean56'] + 0.01) result['sgenre_mean_56_1000_rate'] = result['genre_mean56'] / (result['genre_mean1000'] + 0.01) return result
(5) 提取特徵並訓練# import pdb def make_feats(end_date,n_day): t0 = time.time() key = end_date,n_day print('data key為:{}'.format(key)) print('add label') label = get_label(end_date,n_day) print('make feature...') result = [label] result.append(get_store_visitor_feat(label, key, 1000)) # store features# pdb.set_trace() result.append(get_store_visitor_feat(label, key, 56)) # store features result.append(get_store_visitor_feat(label, key, 28)) # store features result.append(get_store_visitor_feat(label, key, 14)) # store features result.append(get_store_exp_visitor_feat(label, key, 1000)) # store exp features result.append(get_store_week_feat(label, key, 1000)) # store dow features result.append(get_store_week_feat(label, key, 56)) # store dow features result.append(get_store_week_feat(label, key, 28)) # store dow features result.append(get_store_week_feat(label, key, 14)) # store dow features result.append(get_store_week_diff_feat(label, key, 58)) # store dow diff features result.append(get_store_week_diff_feat(label, key, 1000)) # store dow diff features result.append(get_store_all_week_feat(label, key, 1000)) # store all week feat result.append(get_store_week_exp_feat(label, key, 1000)) # store dow exp feat result.append(get_store_holiday_feat(label, key, 1000)) # store holiday feat result.append(get_genre_visitor_feat(label, key, 1000)) # genre feature result.append(get_genre_visitor_feat(label, key, 56)) # genre feature result.append(get_genre_visitor_feat(label, key, 28)) # genre feature result.append(get_genre_exp_visitor_feat(label, key, 1000)) # genre feature result.append(get_genre_week_feat(label, key, 1000)) # genre dow feature result.append(get_genre_week_feat(label, key, 56)) # genre dow feature result.append(get_genre_week_feat(label, key, 28)) # genre dow feature result.append(get_genre_week_exp_feat(label, key, 1000)) # genre dow exp feature result.append(get_reserve_feat(label,key)) # air_reserve result.append(get_first_last_time(label,key,1000)) # first time and last time result.append(label) print('merge...') result = my_concat(result) result = second_feat(result) print('data shape:{}'.format(result.shape)) print('spending {}s'.format(time.time() - t0)) return resulttrain_feat = pd.DataFrame()start_date = '2017-03-12'for i in range(58): train_feat_sub = make_feats(date_add_days(start_date, i*(-7)),39) train_feat = pd.concat([train_feat,train_feat_sub])for i in range(1,6): train_feat_sub = make_feats(date_add_days(start_date,i*(7)),42-(i*7)) train_feat = pd.concat([train_feat,train_feat_sub])test_feat = make_feats(date_add_days(start_date, 42),39)predictors = [f for f in test_feat.columns if f not in (['id','store_id','visit_date','end_date','air_area_name','visitors','month'])]import datetimeimport lightgbm as lgbparams = { 'learning_rate': 0.02, 'boosting_type': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'sub_feature': 0.7, 'num_leaves': 60, 'min_data': 100, 'min_hessian': 1, 'verbose': -1,}t0 = time.time()lgb_train = lgb.Dataset(train_feat[predictors], train_feat['visitors'])lgb_test = lgb.Dataset(test_feat[predictors], test_feat['visitors'])gbm = lgb.train(params,lgb_train,2300,verbose_eval=100)pred = gbm.predict(test_feat[predictors])print('訓練用時{}秒'.format(time.time() - t0))subm = pd.DataFrame({'id':test_feat.store_id + '_' + test_feat.visit_date,'visitors':np.expm1(pred)})subm = submission[['id']].merge(subm,on='id',how='left').fillna(0)subm.to_csv('./sub{}.csv'.format(datetime.datetime.now().strftime('%Y%m%d_%H%M%S')), index=False, float_format='%.4f')
(6)kaggle提交結果:大概在20名左右,可能需要調參,感覺這是作者私榜的最終程式碼。總榜結果
總結(1) 作者將能提取的特徵都提取了,包括歷史同期趨勢特徵、差分特徵和一些相關統計特徵。但是感覺缺少和歷史同期的差分特徵即同為週三的差值(後續有時間加進去看看結果)
(2)作者處理資料獲取標籤的方法值得學習,並沒有將最後不夠一個週期的資料剔除掉。下圖為作者的構建方法。
(3)沒有使用比賽的評價指標作為最佳化函式目標函式訓練,可能會有結果提升。