common.title

Docs
Quantum Circuit
TYTAN CLOUD

QUANTUM GAMING


Overview
Contact
Event
Project
Research

Terms of service (Web service)

Terms of service (Quantum and ML Cloud service)

Privacy policy


Sign in
Sign up
common.title

xfeatつかってみた

Yuichiro Minato

2021/07/29 13:58

#xfeat

今回は特徴量エンジニアリング用のツールであるxfeatをみてみました。

参考記事は、
https://acro-engineer.hatenablog.com/entry/2020/12/15/120000
https://zenn.dev/atfujita/articles/ca5d39425f5520dc9719

まずはインストール

!pip install git+https://github.com/pfnet-research/xfeat.git

まずはいつもautogluonで使っているデータセットを呼び出してみました。

import pandas as pd
from xfeat import SelectCategorical, LabelEncoder, Pipeline, ConcatCombination, SelectNumerical, ArithmeticCombinations, TargetEncoder, aggregation, GBDTFeatureSelector, GBDTFeatureExplorer

df = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')

10行見てみます。

df.head(10)
   age   workclass  fnlwgt      education  education-num       marital-status  \
0   25     Private  178478      Bachelors             13        Never-married   
1   23   State-gov   61743        5th-6th              3        Never-married   
2   46     Private  376789        HS-grad              9        Never-married   
3   55           ?  200235        HS-grad              9   Married-civ-spouse   
4   36     Private  224541        7th-8th              4   Married-civ-spouse   
5   51     Private  178054   Some-college             10   Married-civ-spouse   
6   33     Private  263561   Some-college             10   Married-civ-spouse   
7   46     Private  173613        HS-grad              9             Divorced   
8   18     Private  214617   Some-college             10        Never-married   
9   43     Private   84661      Assoc-voc             11   Married-civ-spouse   

           occupation    relationship    race      sex  capital-gain  \
0        Tech-support       Own-child   White   Female             0   
1    Transport-moving   Not-in-family   White     Male             0   
2       Other-service   Not-in-family   White     Male             0   
3                   ?         Husband   White     Male             0   
4   Handlers-cleaners         Husband   White     Male             0   
5               Sales         Husband   White     Male             0   
6        Craft-repair         Husband   White     Male             0   
7        Adm-clerical   Not-in-family   White   Female             0   
8   Handlers-cleaners       Own-child   White     Male             0   
9               Sales         Husband   White     Male             0   

   capital-loss  hours-per-week  native-country   class  
0             0              40   United-States   <=50K  
1             0              35   United-States   <=50K  
2             0              15   United-States   <=50K  
3             0              50   United-States    >50K  
4             0              40     El-Salvador   <=50K  
5             0              40               ?    >50K  
6             0              60   United-States    >50K  
7             0              40   United-States   <=50K  
8             0              30   United-States   <=50K  
9             0              45   United-States   <=50K  

データが文字のものが見れます。

SelectCategorical().fit_transform(df).head(5)
    workclass   education       marital-status          occupation  \
0     Private   Bachelors        Never-married        Tech-support   
1   State-gov     5th-6th        Never-married    Transport-moving   
2     Private     HS-grad        Never-married       Other-service   
3           ?     HS-grad   Married-civ-spouse                   ?   
4     Private     7th-8th   Married-civ-spouse   Handlers-cleaners   

     relationship    race      sex  native-country   class  
0       Own-child   White   Female   United-States   <=50K  
1   Not-in-family   White     Male   United-States   <=50K  
2   Not-in-family   White     Male   United-States   <=50K  
3         Husband   White     Male   United-States    >50K  
4         Husband   White     Male     El-Salvador   <=50K  

データが数字のものが見れます

SelectNumerical().fit_transform(df).head(5)
   age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week
0   25  178478             13             0             0              40
1   23   61743              3             0             0              35
2   46  376789              9             0             0              15
3   55  200235              9             0             0              50
4   36  224541              4             0             0              40

文字を数値に変換します。いらない列の指定や、元のデータに上書きするかどうかを選べます。
output_suffix=""とすることで上書きされます。

encoder = Pipeline([SelectCategorical(exclude_cols=['education']),LabelEncoder(output_suffix="")])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
   workclass  marital-status  occupation  relationship  race  sex  \
0          0               0           0             0     0    0   
1          1               0           1             1     0    1   
2          0               0           2             1     0    1   

   native-country  class  
0               0      0  
1               0      0  
2               0      0  

また、output_suffixに追記の文字を入れると列が追加されました。

encoder = Pipeline([SelectCategorical(exclude_cols=['education']),LabelEncoder(output_suffix="_en")])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
    workclass  marital-status         occupation    relationship    race  \
0     Private   Never-married       Tech-support       Own-child   White   
1   State-gov   Never-married   Transport-moving   Not-in-family   White   
2     Private   Never-married      Other-service   Not-in-family   White   

       sex  native-country   class  workclass_en  marital-status_en  \
0   Female   United-States   <=50K             0                  0   
1     Male   United-States   <=50K             1                  0   
2     Male   United-States   <=50K             0                  0   

   occupation_en  relationship_en  race_en  sex_en  native-country_en  \
0              0                0        0       0                  0   
1              1                1        0       1                  0   
2              2                1        0       1                  0   

   class_en  
0         0  
1         0  
2         0  

カテゴリの組合せができるようです。手作業でやるのは、組合せ多くなるので面倒ですよね。

encoder = Pipeline([SelectCategorical(exclude_cols=['education']), ConcatCombination(output_suffix="_re", r=2),])
encoded_df = encoder.fit_transform(df)
encoded_df.head(3)
    workclass  marital-status         occupation    relationship    race  \
0     Private   Never-married       Tech-support       Own-child   White   
1   State-gov   Never-married   Transport-moving   Not-in-family   White   
2     Private   Never-married      Other-service   Not-in-family   White   

       sex  native-country   class workclassmarital-status_re  \
0   Female   United-States   <=50K      Private Never-married   
1     Male   United-States   <=50K    State-gov Never-married   
2     Male   United-States   <=50K      Private Never-married   

        workclassoccupation_re  ...   relationshiprace_re  \
0         Private Tech-support  ...       Own-child White   
1   State-gov Transport-moving  ...   Not-in-family White   
2        Private Other-service  ...   Not-in-family White   

    relationshipsex_re relationshipnative-country_re  relationshipclass_re  \
0     Own-child Female       Own-child United-States       Own-child <=50K   
1   Not-in-family Male   Not-in-family United-States   Not-in-family <=50K   
2   Not-in-family Male   Not-in-family United-States   Not-in-family <=50K   

      racesex_re racenative-country_re  raceclass_re   sexnative-country_re  \
0   White Female   White United-States   White <=50K   Female United-States   
1     White Male   White United-States   White <=50K     Male United-States   
2     White Male   White United-States   White <=50K     Male United-States   

     sexclass_re native-countryclass_re  
0   Female <=50K    United-States <=50K  
1     Male <=50K    United-States <=50K  
2     Male <=50K    United-States <=50K  

[3 rows x 36 columns]

皆さん詳しいいろいろな記事を書かれているので、今回はこの辺りで以上にしたいと思います。個人的にはauto gluonやlightGBMと組み合わせて使うような記事に興味を持ちました。

© 2025, blueqat Inc. All rights reserved