今回は特徴量エンジニアリング用のツールであるxfeatをみてみました。
参考記事は、
まずはインストール
!pip install git+https://github.com/pfnet-research/xfeat.git
まずはいつもautogluonで使っているデータセットを呼び出してみました。
import pandas as pd
from xfeat import SelectCategorical, LabelEncoder, Pipeline, ConcatCombination, SelectNumerical, ArithmeticCombinations, TargetEncoder, aggregation, GBDTFeatureSelector, GBDTFeatureExplorer
df = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
10行見てみます。
df.head(10)
age workclass fnlwgt education education-num marital-status \
0 25 Private 178478 Bachelors 13 Never-married
1 23 State-gov 61743 5th-6th 3 Never-married
2 46 Private 376789 HS-grad 9 Never-married
3 55 ? 200235 HS-grad 9 Married-civ-spouse
4 36 Private 224541 7th-8th 4 Married-civ-spouse
5 51 Private 178054 Some-college 10 Married-civ-spouse
6 33 Private 263561 Some-college 10 Married-civ-spouse
7 46 Private 173613 HS-grad 9 Divorced
8 18 Private 214617 Some-college 10 Never-married
9 43 Private 84661 Assoc-voc 11 Married-civ-spouse
occupation relationship race sex capital-gain \
0 Tech-support Own-child White Female 0
1 Transport-moving Not-in-family White Male 0
2 Other-service Not-in-family White Male 0
3 ? Husband White Male 0
4 Handlers-cleaners Husband White Male 0
5 Sales Husband White Male 0
6 Craft-repair Husband White Male 0
7 Adm-clerical Not-in-family White Female 0
8 Handlers-cleaners Own-child White Male 0
9 Sales Husband White Male 0
capital-loss hours-per-week native-country class
0 0 40 United-States <=50K
1 0 35 United-States <=50K
2 0 15 United-States <=50K
3 0 50 United-States >50K
4 0 40 El-Salvador <=50K
5 0 40 ? >50K
6 0 60 United-States >50K
7 0 40 United-States <=50K
8 0 30 United-States <=50K
9 0 45 United-States <=50K
データが文字のものが見れます。
SelectCategorical().fit_transform(df).head(5)
workclass education marital-status occupation \
0 Private Bachelors Never-married Tech-support
1 State-gov 5th-6th Never-married Transport-moving
2 Private HS-grad Never-married Other-service
3 ? HS-grad Married-civ-spouse ?
4 Private 7th-8th Married-civ-spouse Handlers-cleaners
relationship race sex native-country class
0 Own-child White Female United-States <=50K
1 Not-in-family White Male United-States <=50K
2 Not-in-family White Male United-States <=50K
3 Husband White Male United-States >50K
4 Husband White Male El-Salvador <=50K
データが数字のものが見れます
SelectNumerical().fit_transform(df).head(5)
age fnlwgt education-num capital-gain capital-loss hours-per-week
0 25 178478 13 0 0 40
1 23 61743 3 0 0 35
2 46 376789 9 0 0 15
3 55 200235 9 0 0 50
4 36 224541 4 0 0 40
文字を数値に変換します。いらない列の指定や、元のデータに上書きするかどうかを選べます。
output_suffix=""とすることで上書きされます。
encoder = Pipeline([SelectCategorical(exclude_cols=['education']),LabelEncoder(output_suffix="")])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
workclass marital-status occupation relationship race sex \
0 0 0 0 0 0 0
1 1 0 1 1 0 1
2 0 0 2 1 0 1
native-country class
0 0 0
1 0 0
2 0 0
また、output_suffixに追記の文字を入れると列が追加されました。
encoder = Pipeline([SelectCategorical(exclude_cols=['education']),LabelEncoder(output_suffix="_en")])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
workclass marital-status occupation relationship race \
0 Private Never-married Tech-support Own-child White
1 State-gov Never-married Transport-moving Not-in-family White
2 Private Never-married Other-service Not-in-family White
sex native-country class workclass_en marital-status_en \
0 Female United-States <=50K 0 0
1 Male United-States <=50K 1 0
2 Male United-States <=50K 0 0
occupation_en relationship_en race_en sex_en native-country_en \
0 0 0 0 0 0
1 1 1 0 1 0
2 2 1 0 1 0
class_en
0 0
1 0
2 0
カテゴリの組合せができるようです。手作業でやるのは、組合せ多くなるので面倒ですよね。
encoder = Pipeline([SelectCategorical(exclude_cols=['education']), ConcatCombination(output_suffix="_re", r=2),])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
workclass marital-status occupation relationship race \
0 Private Never-married Tech-support Own-child White
1 State-gov Never-married Transport-moving Not-in-family White
2 Private Never-married Other-service Not-in-family White
sex native-country class workclassmarital-status_re \
0 Female United-States <=50K Private Never-married
1 Male United-States <=50K State-gov Never-married
2 Male United-States <=50K Private Never-married
workclassoccupation_re ... relationshiprace_re \
0 Private Tech-support ... Own-child White
1 State-gov Transport-moving ... Not-in-family White
2 Private Other-service ... Not-in-family White
relationshipsex_re relationshipnative-country_re relationshipclass_re \
0 Own-child Female Own-child United-States Own-child <=50K
1 Not-in-family Male Not-in-family United-States Not-in-family <=50K
2 Not-in-family Male Not-in-family United-States Not-in-family <=50K
racesex_re racenative-country_re raceclass_re sexnative-country_re \
0 White Female White United-States White <=50K Female United-States
1 White Male White United-States White <=50K Male United-States
2 White Male White United-States White <=50K Male United-States
sexclass_re native-countryclass_re
0 Female <=50K United-States <=50K
1 Male <=50K United-States <=50K
2 Male <=50K United-States <=50K
[3 rows x 36 columns]
皆さん詳しいいろいろな記事を書かれているので、今回はこの辺りで以上にしたいと思います。個人的にはauto gluonやlightGBMと組み合わせて使うような記事に興味を持ちました。