Multi-Class for GBDT leaf encoding
根據文件 gbdt.apply return 的是 Matrix 是 (n_samples, n_estimators, n_classes)
Returns
-------
X_leaves : array-like, shape (n_samples, n_estimators, n_classes)
For each datapoint x in X and for each tree in the ensemble,
return the index of the leaf x ends up in each estimator.
In the case of binary classification n_classes is 1.
因為範例是 binary classification 問題,所以回傳的矩陣是 (n_samples, n_estimators, 1),
使用 gdbt.apply(val_X)[:, :, 0] 取得二維矩陣 (n_samples, n_estimators) 再餵給 one-hot-encoder
但是對於 Multi-Class 問題,假設有 3個 class, gbdt.apply 得到的結果會是 (n_samples, n_estimators, 3),
我目前想到的方法是將 gbdt.apply 的結果 reshape 成 (n_samples, n_estimators * 3) 再餵給 one-hot-encoder
我的問題是, 面對 Multi-class 問題, 我不確定這樣的處理行為合不合理, 以及或者實務上會不會有其他更好的做法
以下是我測試使用的代碼
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
gdbt = GradientBoostingClassifier()
lr = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='ovr')
x, y = make_classification(n_samples=1000, n_classes=3, n_informative=3)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=9453)
gdbt.fit(train_x, train_y)
onehot = OneHotEncoder(categories='auto')
print(f"gdbt.apply(train_x).shape: {gdbt.apply(train_x).shape}\n-")
enc_train_x = onehot.fit_transform(gdbt.apply(train_x).reshape(train_x.shape[0], -1))
enc_test_x = onehot.transform(gdbt.apply(test_x).reshape(test_x.shape[0], -1))
print(f"train_x: {train_x.shape}, enc_train_x: {enc_train_x.shape},\n"
f"test_x: {test_x.shape}, enc_test_x: {enc_test_x.shape}\n-")
lr.fit(enc_train_x, train_y)
lr_score = lr.score(
enc_test_x,
test_y
)
gdbt_score = gdbt.score(test_x, test_y)
print(f"Score: gdbt={gdbt_score}, gdbt_lr={lr_score}")
輸出結果:
gdbt.apply(train_x).shape: (800, 100, 3)
-
train_x: (800, 20), enc_train_x: (800, 2198),
test_x: (200, 20), enc_test_x: (200, 2198)
-
Score: gdbt=0.895, gdbt_lr=0.89
回答列表
-
2019/09/29 下午 09:09李志鴻贊同數:0不贊同數:0留言數:0
程式碼區塊看起來可讀性不好,附上完整截圖
-
2019/10/01 下午 00:54陳明佑 (Ming You Chen)贊同數:0不贊同數:1留言數:0
基本上, 如果是葉編碼碰上三個類別的話
因為三棵的機率分布會不同
所以三棵樹都要編碼, 你的寫法沒什麼問題
二元分類之所以只編一棵, 是因為兩個分類結果完全相關
相關度100%, 兩棵都編沒有意義