Loading...

使用bin、np.linspace、pd.cut哪種較適合? - Cupoy

關於 age_data['YEARS_BINNED'] = pd.cut(age_data['YEA...

ml100-2,裝箱,bin,ml100-2-d11

AI共學社群

使用bin、np.linspace、pd.cut哪種較適合?

2019/04/30 上午 10:49

機器學習共學討論版

吳瑞洲

觀看數：7

回答數：1

收藏數：1

ml100-2

裝箱

bin

ml100-2-d11

關於 age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = bin_cut)

bin_cut 我使用了三種方式

1. [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]

2. np.linspace(20, 70, 10)

得到 array([20. , 25.55555556, 31.11111111, 36.66666667, 42.22222222,
47.77777778, 53.33333333, 58.88888889, 64.44444444, 70. ])

3. pd.cut(age_data['YEARS_BIRTH'], bins = 10)

得到

請問到底這三種用哪一個比較合適？

回答列表

2019/04/30 上午 11:29

張維元 (WeiYuan)

贊同數：0

不贊同數：0

留言數：1

通常這種分組的做法我們稱為裝箱（bin），pd.cut(data, bins=bin_cut) 中 bin_cut 代表是怎麼裝，實際用法我們可以參考文件：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

bins 可以接受 int、 sequence 或 IntervalIndex，使用差異如下：

* int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

* sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

你的作法前兩種屬於 sequence，第三種是給定一個 int，其實都是差不多等分的間隔。只是預設 int 的方式可能有些誤差，因為會依照原本資料的最大最小往外 .1% ，切起來可能不是那麼整齊 (?)