こんにちは！コンサルティングサービス本部の野村です。

今回は機械学習の自動化ツールであるfeaturetoolsとTPOTを使って
SIGNATEのお弁当の需要予測をしてみたのでご紹介したいと思います。

SIGNATEとは

機械学習のコンペティションサイト（企業や政府がコンペ形式で課題を投稿し
賞金をかけて最適モデルを競い合うプラットフォーム）ではKaggleが有名ですが
そちらの日本版になります。

日本の様々な企業や政府からのコンペティションが既に投稿されており、
練習問題として機械学習のためのサンプル問題もいくつか掲載されています。

今回はその中からお弁当の需要予測にトライしてみたいと思います。
内容はこんな感じです。

千代田区四番町のとある会社のカフェフロアで販売されているお弁当の販売数を予測するモデルを作成していただきます。
お弁当の売上向上・廃棄削減のためには、正確な需要予測に基づき生産計画を立てる必要があります。
今回のコンペでは、曜日やメニュー等の複数の変数から最適なお弁当の量を推測していただき、お弁当屋さんと、その利用者、そして環境に貢献していただきます。

環境 Google Colaboratory

実施環境はGoogle Colabを利用しています。
Jupyter Notebookを無償で利用できるクラウドサービスで、機械学習でよく使われる主なパッケージが最初から入っていて
すぐに機械学習を試してみることができるので大変便利です。さらにGPUやTPUまでも無償で利用できます。

ただし90分毎にセッションが切れ、12時間でインスタンスがリセットされるので
Google Driveを利用するなどの工夫が必要です。

やってみる

では実際にやっていきたいと思います。

初期準備

まずはSIGNATEから対象のデータ（sample.csv、test.csv、train.csv）をDLし、Google Driveに格納しておきます。

次に、ColabでGoogle Driveをマウントします。

In [1]:

from google.colab import drive
  drive.mount('/content/drive')

In [2]:

# データの確認
  !ls "/content/drive/My Drive/指定フォルダ"

sample.csv  test.csv  train.csv

In [3]:

# ライブラリのインポート
  import pandas as pd
  import numpy as np
  from matplotlib import pyplot as plt
  import seaborn as sns
  sns.set(font="IPAGothic",style="white")
  from sklearn.linear_model import LinearRegression
  from sklearn.feature_selection import RFE
  from sklearn.metrics import mean_squared_error as MSE
  from sklearn.ensemble import RandomForestRegressor

In [4]:

# データの読み込み
  train = pd.read_csv('./drive/My Drive/指定フォルダ/train.csv',index_col=['datetime'])
  test = pd.read_csv('./drive/My Drive/指定フォルダ/test.csv',index_col=['datetime'])
  sample = pd.read_csv('./drive/My Drive/指定フォルダ/sample.csv', header=None)
  print(train.shape,test.shape,sample.shape)

(207, 11) (40, 10) (40, 2)

今回の対象データセットの内容はこんな感じです。

カラム	ヘッダ名称	データ型	説明
0	datetid	datetime	インデックスとして使用する日付（yyyy-m-d）
1	y	int	販売数（目的変数）
2	week	char	曜日（月～金）
3	soldout	boolean	完売フラグ（0:完売せず、1:完売）
4	name	varchar	メインメニュー
5	kcal	int	おかずのカロリー（kcal）欠損有り
6	remarks	varchar	特記事項
7	event	varchar	13時開始お弁当持ち込み可の社内イベント
8	payday	boolean	給料日フラグ（1:給料日）
9	weather	varchar	天気
10	precipitation	float	降水量。ない場合は “–”
11	temperature	float	気温

In [5]:

# データ内容の確認
  train.head()

Out[5]:

datetime	y	week	soldout	name	kcal	remarks	event	payday	weather	precipitation	temperature
2013-11-18	90	月	0	厚切りイカフライ	NaN	NaN	NaN	NaN	快晴	—	19.8
2013-11-19	101	火	1	手作りヒレカツ	NaN	NaN	NaN	NaN	快晴	—	17.0
2013-11-20	118	水	0	白身魚唐揚げ野菜あん	NaN	NaN	NaN	NaN	快晴	—	15.5
2013-11-21	120	木	1	若鶏ピリ辛焼	NaN	NaN	NaN	NaN	快晴	—	15.2
2013-11-22	130	金	1	ビッグメンチカツ	NaN	NaN	NaN	NaN	快晴	—	16.1

In [6]:

# 統計量の確認
  train.describe()

Out[6]:

	y	soldout	kcal	payday	temperature
count	207.000000	207.000000	166.000000	10.0	207.000000
mean	86.623188	0.449275	404.409639	1.0	19.252174
std	32.882448	0.498626	29.884641	0.0	8.611365
min	29.000000	0.000000	315.000000	1.0	1.200000
25%	57.000000	0.000000	386.000000	1.0	11.550000
50%	78.000000	0.000000	408.500000	1.0	19.800000
75%	113.000000	1.000000	426.000000	1.0	26.100000
max	171.000000	1.000000	462.000000	1.0	34.600000

前処理

データの前処理をしていきます。

In [7]:

# 訓練データとテストデータを結合
  train['t_flg'] = 1
  test['t_flg'] = 0
  all_train = pd.concat([train,test], axis = 0, sort = False)

In [8]:

# 欠損値を確認
  all_train.info()

<class 'pandas.core.frame.DataFrame'>
  Index: 247 entries, 2013-11-18 to 2014-11-28
  Data columns (total 12 columns):
  y                207 non-null float64
  week             247 non-null object
  soldout          247 non-null int64
  name             247 non-null object
  kcal             202 non-null float64
  remarks          28 non-null object
  event            17 non-null object
  payday           12 non-null float64
  weather          247 non-null object
  precipitation    247 non-null object
  temperature      247 non-null float64
  t_flg            247 non-null int64
  dtypes: float64(4), int64(2), object(6)
  memory usage: 25.1+ KB

欠損値を補完します。

In [9]:

# kcal : 平均値で埋める
  all_train['kcal'].describe()

Out[9]:

  count    202.000000
  mean     407.381188
  std       28.396942
  min      315.000000
  25%      395.000000
  50%      412.000000
  75%      427.000000
  max      462.000000
  Name: kcal, dtype: float64

In [10]:

all_train['kcal'] = all_train['kcal'].fillna(all_train['kcal'].mean())

In [11]:

# remarks : ”なし”で埋める
  all_train['remarks'] = all_train['remarks'].fillna('なし')

In [12]:

# event : ”なし”で埋める
  all_train['event'] = all_train['event'].fillna('なし')

In [13]:

# payday : 0で埋める
  all_train['payday'] = all_train['payday'].fillna(0)

In [14]:

# 再度、欠損値の確認
  all_train.info()

<class 'pandas.core.frame.DataFrame'>
  Index: 247 entries, 2013-11-18 to 2014-11-28
  Data columns (total 12 columns):
  y                207 non-null float64
  week             247 non-null object
  soldout          247 non-null int64
  name             247 non-null object
  kcal             247 non-null float64
  remarks          247 non-null object
  event            247 non-null object
  payday           247 non-null float64
  weather          247 non-null object
  precipitation    247 non-null object
  temperature      247 non-null float64
  t_flg            247 non-null int64
  dtypes: float64(4), int64(2), object(6)
  memory usage: 25.1+ KB

In [15]:

all_train.head()

Out[15]:

datetime	y	week	soldout	name	kcal	remarks	event	weather	precipitation	temperature	t_flg
2013-11-18	90.0	月	0	厚切りイカフライ	407.381188	なし	なし	快晴	—	19.8	1
2013-11-19	101.0	火	1	手作りヒレカツ	407.381188	なし	なし	快晴	—	17.0	1
2013-11-20	118.0	水	0	白身魚唐揚げ野菜あん	407.381188	なし	なし	快晴	—	15.5	1
2013-11-21	120.0	木	1	若鶏ピリ辛焼	407.381188	なし	なし	快晴	—	15.2	1
2013-11-22	130.0	金	1	ビッグメンチカツ	407.381188	なし	なし	快晴	—	16.1	1

カテゴリ変数の内容を確認します。

In [16]:

all_train['week'].unique()

Out[16]:

array(['月', '火', '水', '木', '金'], dtype=object)

In [17]:

all_train['name'].unique()

Out[17]:

array(['厚切りイカフライ', '手作りヒレカツ', '白身魚唐揚げ野菜あん', '若鶏ピリ辛焼', 'ビッグメンチカツ', '鶏の唐揚',
         '豚のスタミナ炒め', 'ボローニャ風カツ', 'ハンバーグ', 'タルタルinソーセージカツ', 'マーボ豆腐',
         '厚揚げ豚生姜炒め', 'クリームチーズ入りメンチ', '鶏のカッシュナッツ炒め', '手作りロースカツ',
         'ハンバーグデミソース', 'やわらかロースのサムジョン', '五目御飯', '肉じゃが', 'タンドリーチキン',
         'カキフライタルタル', '回鍋肉', 'ポーク味噌焼き', '鶏の唐揚げ甘酢あん', 'さっくりメンチカツ',
         '手ごね風ハンバーグ', '酢豚', 'カレー入りソーセージカツ', '豚肉の生姜焼', '鶏チリソース',
         '鶏の照り焼きマスタード', 'さんま辛味焼', 'カレイ唐揚げ野菜あんかけ', 'ジューシーメンチカツ', 'サバ焼味噌掛け',
         '手作りひれかつとカレー', '鶏のレモンペッパー焼orカレー', 'チンジャオロース', '海老フライタルタル',
         'チーズ入りメンチカツ', '鶏の唐揚げ', 'メダイ照り焼', 'ハンバーグカレーソース', 'さわら焼味噌掛け',
         '鶏のピリ辛焼き', 'ホタテクリ―ムシチュー', '鶏の唐揚げおろしソース', 'ますのマスタードソース', 'ロース甘味噌焼き',
         '海老フライとホタテ串カツ', 'ハンバーグ和風きのこソース', '酢豚orカレー', 'ポークハヤシ',
         '白身魚唐揚げ野菜あんかけ', '手作りひれかつ', 'メンチカツ', 'チキンクリームシチュー', '海老クリーミ―クノーデル',
         'ビーフカレー', 'カレイ野菜あんかけ', 'チーズ入りハンバーグ', '越冬キャベツのメンチカツ', '鶏の親子煮',
         '肉団子クリームシチュー', 'キーマカレー', '青椒肉絲', '和風ソースハンバーグ', '青梗菜牛肉炒め',
         '肉団子のシチュー', 'チキンカレー', 'ビーフトマト煮', 'ポーク生姜焼き', '牛丼風煮', '鶏の味噌漬け焼き',
         '牛肉筍煮', '鶏の照り焼きマヨ', '中華丼', '豚味噌メンチカツ', 'マーボ茄子', '鶏の天ぷら',
         '手作りチキンカツ', 'きのこソースハンバーグ', '白身魚唐揚げ野菜餡かけ', 'ポークカレー', '豚肉と茄子のピリ辛炒め',
         'チーズハンバーグ', 'サーモンのムニエル2色ソース', '牛肉コロッケ', '牛肉すき焼き風', 'いか天ぷら',
         'ハンバーグケッチャップソース', 'ゴーヤチャンプルー', 'たっぷりベーコンフライ', '牛肉ニンニクの芽炒め',
         'カレイ唐揚げ野菜餡かけ', 'チャプチェ', '牛すき焼き風', 'ポークソテー韓国ソース', 'ビーフストロガノフ',
         'アジ唐揚げ南蛮ソース', '炊き込みご飯', '鶏のトマトシチュー', '豚の冷しゃぶ', 'キスと野菜の天ぷら', '牛丼',
         '鶏の塩から揚げ', 'カレイ唐揚げ夏野菜あん', '白身魚ムニエル', '手作りトンカツ', '和風ハンバーグ',
         'かじきの甘辛煮', 'チキンのコーンクリーム焼き', 'プルコギ', '鶏のから揚げねぎ塩炒めソース', '豚冷シャブ野菜添え',
         '白身魚フライ', '豚すき焼き', 'エビフライ', '八宝菜', 'ジャンボチキンカツ', 'ひやしたぬきうどん・炊き込みご飯',
         '豚肉のマスタード焼き', 'バーベキューチキン', '鶏のから揚げスイートチリソース', '豚肉の生姜焼き',
         'ハンバーグ（デミきのこバター）', '鶏肉のカレー唐揚', '豚キムチ炒め', 'チキン香草焼きマスタードソース',
         'サーモンフライ・タルタル', '厚切ハムカツ', '洋食屋さんのメンチカツ', '牛スキヤキ', '豚ロースのピザ風チーズ焼き',
         'チキン南蛮', 'ロコモコ丼', '白身魚の南部焼き', 'カレイの唐揚げ', '豚肉の胡麻シャブ', 'チキンの辛味噌焼き',
         'ビーフシチュー', '名古屋味噌カツ', '親子煮', 'チキンステーキ・きのこソース', '鶏肉の山賊焼き',
         'ぶりレモンペッパー焼き', 'チーズメンチカツ', 'チキンフリカッセ', 'カレイ唐揚げ 甘酢あん', '厚切イカフライ',
         '筑前煮', '白身魚のマスタード焼き', '牛カルビ焼き肉', 'ランチビュッフェ', '豚肉と玉子の炒め',
         '鶏肉とカシューナッツ炒め', '麻婆春雨', '厚揚げ肉みそ炒め', '完熟トマトのホットカレー', '若鶏梅肉包揚げ',
         'ミックスグリル', '豚肉と白菜の中華炒め', 'ヒレカツ', '豚柳川', '麻婆豆腐', '唐揚げ丼', 'マス塩焼き',
         '鶏肉と野菜の黒胡椒炒め', '彩り野菜と鶏肉の黒酢あん', 'ポークのバーベキューソテー', '鶏肉の唐揚げ',
         '白身魚味噌焼き', 'エビフライ・エビカツ', '野菜ごろごろシチュー', 'ベルギー風チキンのクリーム煮', 'スタミナ炒め',
         'なすと挽肉のはさみ揚げ', '鶏肉の治部煮風', '牛丼風', '鶏肉のスイートチリソース'], dtype=object)

In [18]:

all_train['remarks'].unique()

Out[18]:

array(['なし', '鶏のレモンペッパー焼（50食）、カレー（42食）', '酢豚（28食）、カレー（85食）', 'お楽しみメニュー',
         '料理長のこだわりメニュー', '手作りの味', 'スペシャルメニュー（800円）', '近隣に飲食店複合ビルオープン'],
        dtype=object)

In [19]:

all_train['event'].unique()

Out[19]:

array(['なし', 'ママの会', 'キャリアアップ支援セミナー'], dtype=object)

In [20]:

all_train['weather'].unique()

Out[20]:

array(['快晴', '曇', '晴れ', '薄曇', '雨', '雪', '雷電'], dtype=object)

In [21]:

all_train['precipitation'].unique()

Out[21]:

array(['--', '0.5', '0', '1.5', '1', '6', '6.5', '2.5'], dtype=object)

カテゴリ変数をダミー変数化します。（nameは種類が多いので一旦外す）

In [22]:

ctg_col = ['week','remarks','event','weather','precipitation']

  all_train_ctg = pd.get_dummies(all_train, columns = ctg_col, drop_first=False)

In [23]:

all_train_ctg.info()

<class 'pandas.core.frame.DataFrame'>
  Index: 247 entries, 2013-11-18 to 2014-11-28
  Data columns (total 38 columns):
  y                                   207 non-null float64
  soldout                             247 non-null int64
  name                                247 non-null object
  kcal                                247 non-null float64
  payday                              247 non-null float64
  temperature                         247 non-null float64
  t_flg                               247 non-null int64
  week_月                              247 non-null uint8
  week_木                              247 non-null uint8
  week_水                              247 non-null uint8
  week_火                              247 non-null uint8
  week_金                              247 non-null uint8
  remarks_お楽しみメニュー                    247 non-null uint8
  remarks_なし                          247 non-null uint8
  remarks_スペシャルメニュー（800円）             247 non-null uint8
  remarks_手作りの味                       247 non-null uint8
  remarks_料理長のこだわりメニュー                247 non-null uint8
  remarks_近隣に飲食店複合ビルオープン              247 non-null uint8
  remarks_酢豚（28食）、カレー（85食）            247 non-null uint8
  remarks_鶏のレモンペッパー焼（50食）、カレー（42食）    247 non-null uint8
  event_なし                            247 non-null uint8
  event_キャリアアップ支援セミナー                 247 non-null uint8
  event_ママの会                          247 non-null uint8
  weather_快晴                          247 non-null uint8
  weather_晴れ                          247 non-null uint8
  weather_曇                           247 non-null uint8
  weather_薄曇                          247 non-null uint8
  weather_雨                           247 non-null uint8
  weather_雪                           247 non-null uint8
  weather_雷電                          247 non-null uint8
  precipitation_--                    247 non-null uint8
  precipitation_0                     247 non-null uint8
  precipitation_0.5                   247 non-null uint8
  precipitation_1                     247 non-null uint8
  precipitation_1.5                   247 non-null uint8
  precipitation_2.5                   247 non-null uint8
  precipitation_6                     247 non-null uint8
  precipitation_6.5                   247 non-null uint8
  dtypes: float64(4), int64(2), object(1), uint8(31)
  memory usage: 22.9+ KB

In [24]:

# トレーニング用データの準備
  drop_col = ['y','name','t_flg']
  X_train = all_train_ctg[all_train_ctg['t_flg']==1].drop(drop_col,axis=1)
  y_train = all_train_ctg[all_train_ctg['t_flg']==1]['y'].copy()

ランダムフォレストで特徴選択をしてモデル作成

まずはランダムフォレストを使って特徴選択し、線形回帰でモデルを作成してみたいと思います。

In [25]:

# RandomForestのパラメータ設定
  randomforest = RandomForestRegressor(n_estimators=100,max_depth=4,random_state=777)

In [26]:

randomforest.fit(X_train,y_train)

Out[26]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=4,
                        max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=777, verbose=0,
                        warm_start=False)

In [27]:

# 重要度が高い順にソートして表示
  features = X_train.columns
  importances = randomforest.feature_importances_

  print(sorted(zip(map(lambda x: round(x, 2), randomforest.feature_importances_), features),
               reverse=True))

   [(0.69, 'temperature'), (0.14, 'kcal'), (0.11, 'remarks_お楽しみメニュー'), (0.02, 'week_金'),
    (0.01, 'week_木'), (0.01, 'week_月'),(0.01, 'weather_曇'), (0.0, 'week_火'), (0.0, 'week_水'),]

重要度の高い、temperature、kcal、remarks_お楽しみメニュー、をグラフで確認してみます。

In [28]:

# temperature（気温）を散布図で確認
  sns.jointplot(x='temperature',y='y',data=train)

   /usr/local/lib/python3.6/dist-packages/matplotlib/font_manager.py:1241: UserWarning: findfont:
    Font family ['IPAGothic'] not found. Falling back to DejaVu Sans.
    (prop.get_family(), self.defaultFamily[fontext]))

Out[28]:

<seaborn.axisgrid.JointGrid at 0x7f98043a8320>

In [29]:

# kcal（カロリー）を散布図で確認
  sns.jointplot(x='kcal',y='y',data=train)

Out[29]:

<seaborn.axisgrid.JointGrid at 0x7f9801a39390>

# remarks（特記事項）を箱ひげ図で確認
  fig, ax = plt.subplots(1,1,figsize=(12,7))
  sns.boxplot(x='remarks',y='y',data=train)
  ax.set_xticklabels(ax.get_xticklabels(),rotation=30)

Out[30]:

  [Text(0, 0, '鶏のレモンペッパー焼（50食）、カレー（42食）'),
   Text(0, 0, '酢豚（28食）、カレー（85食）'),
   Text(0, 0, 'お楽しみメニュー'),
   Text(0, 0, '料理長のこだわりメニュー'),
   Text(0, 0, '手作りの味'),
   Text(0, 0, 'スペシャルメニュー（800円）')]

日本語が豆腐になってしまいました。。
フォントをインストールする必要があるようです。

In [31]:

# フォントのインストール
  !apt-get -y install fonts-ipafont-gothic

  Reading package lists... Done
  Building dependency tree
  Reading state information... Done
  The following package was automatically installed and is no longer required:
    libnvidia-common-410
  Use 'apt autoremove' to remove it.
  The following additional packages will be installed:
    fonts-ipafont-mincho
  The following NEW packages will be installed:
    fonts-ipafont-gothic fonts-ipafont-mincho
  0 upgraded, 2 newly installed, 0 to remove and 7 not upgraded.
  Need to get 8,251 kB of archives.
  After this operation, 28.7 MB of additional disk space will be used.
  Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 fonts-ipafont-gothic all 00303-18ubuntu1 [3,526 kB]
  Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 fonts-ipafont-mincho all 00303-18ubuntu1 [4,725 kB]
  Fetched 8,251 kB in 1s (7,344 kB/s)
  Selecting previously unselected package fonts-ipafont-gothic.
  (Reading database ... 131331 files and directories currently installed.)
  Preparing to unpack .../fonts-ipafont-gothic_00303-18ubuntu1_all.deb ...
  Unpacking fonts-ipafont-gothic (00303-18ubuntu1) ...
  Selecting previously unselected package fonts-ipafont-mincho.
  Preparing to unpack .../fonts-ipafont-mincho_00303-18ubuntu1_all.deb ...
  Unpacking fonts-ipafont-mincho (00303-18ubuntu1) ...
  Setting up fonts-ipafont-gothic (00303-18ubuntu1) ...
  update-alternatives: using /usr/share/fonts/opentype/ipafont-gothic/ipag.ttf
  to provide /usr/share/fonts/truetype/fonts-japanese-gothic.ttf (fonts-japanese-gothic.ttf) in auto mode
  Setting up fonts-ipafont-mincho (00303-18ubuntu1) ...
  update-alternatives: using /usr/share/fonts/opentype/ipafont-mincho/ipam.ttf
  to provide /usr/share/fonts/truetype/fonts-japanese-mincho.ttf (fonts-japanese-mincho.ttf) in auto mode
  Processing triggers for fontconfig (2.12.6-0ubuntu2) ...

In [32]:

# キャッシュの確認
  import matplotlib
  matplotlib.get_cachedir()

Out[32]:

'/root/.cache/matplotlib'

In [33]:

# キャッシュの削除
  !rm -r /root/.cache/matplotlib/

ラインタイムを再起動し、再度表示させてみます。

In [34]:

fig, ax = plt.subplots(1,1,figsize=(12,7))
  sns.boxplot(x='remarks',y='y',data=train)
  ax.set_xticklabels(ax.get_xticklabels(),rotation=30)

Out[34]:

  [Text(0, 0, '鶏のレモンペッパー焼（50食）、カレー（42食）'),
   Text(0, 0, '酢豚（28食）、カレー（85食）'),
   Text(0, 0, 'お楽しみメニュー'),
   Text(0, 0, '料理長のこだわりメニュー'),
   Text(0, 0, '手作りの味'),
   Text(0, 0, 'スペシャルメニュー（800円）')]

うまく表示できました。
お楽しみメニューは特に高い相関関係がありそうです。

ということで、’temperature’、’kcal’、’remarks_お楽しみメニュー’を特徴量として線形回帰で学習させてみます。

In [35]:

X_train = X_train[['temperature','kcal','remarks_お楽しみメニュー']]

In [36]:

# 線形回帰モデル作成
  linear = LinearRegression()
  linear.fit(X_train,y_train)

Out[36]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [37]:

# 予測
  pred = linear.predict(X_train)

In [38]:

# 実際の値（y_train）と予測値（pred）のRMSE（平均二乗誤差）を出す
  print("RMSE",np.sqrt(MSE(y_train,pred)))

RMSE 22.308887339834545

In [39]:

# 線形回帰の予測値と実績値のグラフ
  p = pd.DataFrame({"actual":y_train,"pred":pred})
  p.plot(figsize=(13,3))

Out[39]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe0c560e198>

featuretoolsとTPOTを利用してモデル作成

次に特徴量エンジニアリングにfeaturetools、アルゴリズム選定にTPOTを利用してモデルを作成してみたいと思います。

まずはfeaturetoolsから。

In [40]:

# ライブラリのインポート
  import featuretools as ft

  # インデックスのリセット
  all_train_ctg = all_train_ctg.reset_index()

  # Entity Setの作成
  es = ft.EntitySet(id='entityset')

  # Entityの追加
  es = es.entity_from_dataframe(entity_id='train',dataframe=all_train_ctg,index='index')

2019-07-29 04:40:10,581 featuretools.entityset - WARNING    index index not found in dataframe, creating new integer column

In [41]:

# 生成する特徴量の指定
  trans_primitives = ["year", "month","day","weekday"]

  # 特徴量の生成
  feature_matrix, features_defs = ft.dfs(entityset=es,
                                         target_entity="train",
                                         trans_primitives=trans_primitives,
                                         max_depth=1
                                        )
  # 特徴量の確認
  feature_matrix.info()

<class 'pandas.core.frame.DataFrame'>
  Int64Index: 247 entries, 0 to 246
  Data columns (total 42 columns):
  y                                   207 non-null float64
  soldout                             247 non-null int64
  name                                247 non-null object
  kcal                                247 non-null float64
  payday                              247 non-null float64
  temperature                         247 non-null float64
  t_flg                               247 non-null int64
  week_月                              247 non-null uint8
  week_木                              247 non-null uint8
  week_水                              247 non-null uint8
  week_火                              247 non-null uint8
  week_金                              247 non-null uint8
  remarks_お楽しみメニュー                    247 non-null uint8
  remarks_なし                          247 non-null uint8
  remarks_スペシャルメニュー（800円）             247 non-null uint8
  remarks_手作りの味                       247 non-null uint8
  remarks_料理長のこだわりメニュー                247 non-null uint8
  remarks_近隣に飲食店複合ビルオープン              247 non-null uint8
  remarks_酢豚（28食）、カレー（85食）            247 non-null uint8
  remarks_鶏のレモンペッパー焼（50食）、カレー（42食）    247 non-null uint8
  event_なし                            247 non-null uint8
  event_キャリアアップ支援セミナー                 247 non-null uint8
  event_ママの会                          247 non-null uint8
  weather_快晴                          247 non-null uint8
  weather_晴れ                          247 non-null uint8
  weather_曇                           247 non-null uint8
  weather_薄曇                          247 non-null uint8
  weather_雨                           247 non-null uint8
  weather_雪                           247 non-null uint8
  weather_雷電                          247 non-null uint8
  precipitation_--                    247 non-null uint8
  precipitation_0                     247 non-null uint8
  precipitation_0.5                   247 non-null uint8
  precipitation_1                     247 non-null uint8
  precipitation_1.5                   247 non-null uint8
  precipitation_2.5                   247 non-null uint8
  precipitation_6                     247 non-null uint8
  precipitation_6.5                   247 non-null uint8
  YEAR(datetime)                      247 non-null int64
  MONTH(datetime)                     247 non-null int64
  DAY(datetime)                       247 non-null int64
  WEEKDAY(datetime)                   247 non-null int64
  dtypes: float64(4), int64(6), object(1), uint8(31)
  memory usage: 30.6+ KB

元のdatetimeに対して新しい特徴量のYEAR、MONTH、DAY、WEEKDAYが自動的に追加されました。

続いてTPOTを使って、最適なアルゴリズムの選定と、ハイパーパラメータのチューニングを自動的に行ってみます。

In [42]:

# パッケージのインストール
  !pip install tpot

In [43]:

# ライブラリのインポート
  from tpot import TPOTRegressor

In [44]:

# TPOTのパラメータを設定
  tpot = TPOTRegressor(generations=10, population_size=50, verbosity=2, random_state=700)

  # トレーニング用データの準備
  drop_col = ['y','name','t_flg']
  X_train = feature_matrix[feature_matrix['t_flg']==1].drop(drop_col,axis=1)
  y_train = feature_matrix[feature_matrix['t_flg']==1]['y'].copy()

In [45]:

tpot.fit(X_train,y_train)

  Generation 1 - Current best internal CV score: -327.8795512538697
  Generation 2 - Current best internal CV score: -327.2082473582626
  Generation 3 - Current best internal CV score: -323.1962201374389
  Generation 4 - Current best internal CV score: -318.6683588161063
  Generation 5 - Current best internal CV score: -318.6683588161063
  Generation 6 - Current best internal CV score: -283.4378829187741
  Generation 7 - Current best internal CV score: -283.4378829187741
  Generation 8 - Current best internal CV score: -283.4378829187741
  Generation 9 - Current best internal CV score: -283.4378829187741
  Generation 10 - Current best internal CV score: -283.4378829187741

  Best pipeline: XGBRegressor(input_matrix, learning_rate=0.5, max_depth=4, min_child_weight=20,
  n_estimators=100, nthread=1, objective=reg:squarederror, subsample=0.5)

Out[45]:

TPOTRegressor(config_dict=None, crossover_rate=0.1, cv=5,
                disable_update_check=False, early_stop=None, generations=10,
                max_eval_time_mins=5, max_time_mins=None, memory=None,
                mutation_rate=0.9, n_jobs=1, offspring_size=None,
                periodic_checkpoint_folder=None, population_size=50,
                random_state=700, scoring=None, subsample=1.0, template=None,
                use_dask=False, verbosity=2, warm_start=False)

２～３分程かかりましたが、最適なアルゴリズムとしてXGBoostが選定され、
そのハイパーパラメータも自動的にチューニングされたうえでモデルが作成されました。

In [46]:

# 予測
  pred = tpot.predict(X_train)

In [47]:

# 実際の値（y_train）と予測値（pred）のRMSEを出す
  print("RMSE",np.sqrt(MSE(y_train,pred)))

RMSE 9.074974604234542

In [48]:

# 線形回帰の予測値と実績値のグラフ
  p = pd.DataFrame({"actual":y_train,"pred":pred})
  p.plot(figsize=(13,3))

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fe0b41a0908>

featuretoolsとTPOTを使った時の方が
RMSEが9.07になり、グラフ的にもかなり改善されているのがわかります。

まとめ

以上、簡単ではありましたが、機械学習の自動化ツールであるfeaturetoolsとTPOTを使って
SIGNATEのお弁当の需要予測をやってみました。

チュートリアルでは関連性の低いデータをカットしたり、EDAによる特徴量エンジニアリング、
アンサンブル学習やcross validationでの検証等を行っており
実際の現場でさらに精度を上げていくためには、機械学習に関する様々な経験や
ビジネスドメインの知識を増やしていく必要あると感じました。

日々精進したいと思います！