LabelEncoder使用問題

2020/03/01 下午 05:24

機器學習共學討論版

陳維仁

觀看數：17

回答數：3

收藏數：0

ml100-4

ml100-4-d06

先付上sample code中的截圖

想請問一下，為什麼在做label.fit時，通常只會對training data做，我查了一下.fit的documentation，只有提到Fit label encoder，看了下範例，.fit(y)的功能像是.fit()創一個空字典，y是填入字典的值。這裡附上連結:https://www.itread01.com/content/1503307229.html

問題1: .fit的功能為何?

問題2: 為何.fit只會對training data做?

問題3: 針對問題2，stackoverflow中case2的敘述我不是很懂，為什麼training 和 testing有同樣種類的label是redundant?

https://stackoverflow.com/questions/52279902/sklearn-why-labelencoder-fit-only-to-training-data

非常謝謝您

回答列表

2020/03/01 下午 08:08

Allen

贊同數：5

不贊同數：0

留言數：1

針對LabelEncoder()的方法，是將每一個元素給上數字標籤。

問題1: .fit的功能為何?

       .fit的功能是用來訓練為每個元素給上標籤的能力。

問題2: 為何.fit只會對training data做?

        因為我們假設training data所有元素都包含test data，大多情形都要是如此。

問題3:

        因為training data所有元素都包含test data，所以不需要將兩個部分合在一起.fit()訓練。

假設:

Y1 = ['dog','cat','horse']

Y2 = ['dog','cat','dog','horse','cat']

LabelEncoder1 = LabelEncoder().fit(Y1)

LabelEncoder2 = LabelEncoder().fit(Y2)

此時LabelEncoder1 與 LabelEncoder2 功能都是一樣的，為每一個元素給上對應標籤。

但是LabelEncoder2 是多餘的，執行的時間比較久。這就是為什麼不要將training data、test data合在一起fit()訓練的原因。

LabelEncoder的運作原理，簡單說：

    Y2 = ['dog','cat','dog','horse','cat']

    集合表 = {}

1.看到'dog'，因為集合表沒有，將'dog'給上標籤0。

    集合表 = {'dog'=0}

2.看到'cat'，因為集合表沒有，將'cat'給上標籤1。

    集合表 = {'dog'=0,'cat'=1}

3.看到'dog'，集合表裡已經有'dog'了。

    集合表 = {'dog'=0,'cat'=1}

4.看到'horse'，因為集合表沒有，將'horse'給上標籤2。

    集合表 = {'dog'=0,'cat'=1,'horse'=2}

5.看到'cat'，集合表裡已經有'cat'了。

    集合表 = {'dog'=0,'cat'=1,'horse'=2}

大約是這樣的步驟，最後我們得到能夠將元素給上對應標籤的集合表。所以第3、5步驟是多餘的。
2020/03/01 下午 08:30

李子明

贊同數：3

不贊同數：0

留言數：0

LabelEncoder用來把文字轉換成數字型的編號，相同的詞彙會有相同的編號，不同的詞彙就用不同的編號。

在LabelEncoder中為了要達成這個功能，就要

一、知道全部有哪些詞彙，所以要使用fit來得知全部的詞彙，並依據不同的詞彙設定出不同的編號。

二、當詞彙有了不同的編號後，就可以使用transform來把目的地的詞彙轉成數字編號了。

一般而言，我們會期待training data已經包括了全部的詞彙了，也就是說test data會出現的詞彙也會出現在training dtat中，所以只需要對training data進行fit就可以了。

但這個時候如果我們也對test data進行fit的話，就是你講的redundant了，因為test data的詞彙已經包含於training data中，所以fit的結果是不變的，對test data進行fit就變成做白工了。
2020/03/01 下午 11:59

張維元 (WeiYuan)

贊同數：4

不贊同數：0

留言數：0

嗨，以下簡單回覆你的問題

問題1: .fit的功能為何?

=> fit 就是 ML 中的 training 操作，也就是從 training data 學習 X 與 Y 的這個行為。

問題2: 為何.fit只會對training data做?

=> 因為我們應該只會事前拿到 training data ，因此我們通常是「從 training data 中找關係，應用到 testing data 上」。但實務上，如果你可以事先拿到 testing data 的話，也是可以用 testing data + training data 來做 fit 的。

問題3: 針對問題2，stackoverflow中case2的敘述我不是很懂，為什麼training 和 testing有同樣種類的label是redundant?

=> 應該這樣說，testing 中不應該會有「新出現」的資料。如果有的話，「從 training data 中找關係，應用到 testing data 上」這件事會不成立。你要怎麼預測一個從來沒出現過的新行為呢？

如果這個回答對你有幫助請主動點選「有幫助」的按鈕，也可以追蹤我的 GITHUB 帳號。若還有問題的話，也歡迎繼續再追問或者把你理解的部分整理上來，我都會提供你 Review 和 Feedback 😃😃😃