tf-idf之計算結果與公式似乎對不太起來 - Cupoy

根據tf-idf的公式驗算了一下範例文件的結果發現跟經典公式似乎對不太起來？例如corpus中的&q...

tf-idf之計算結果與公式似乎對不太起來

2021/01/17 下午 11:57

Term Frequency - Inverted Document Frequency (TF-IDF)

黃易辰

觀看數：44

回答數：2

收藏數：1

根據tf-idf的公式驗算了一下範例文件的結果發現跟經典公式似乎對不太起來？例如corpus中的"document"在第一個文件中根據公式應該是 (1/5)*log(4/3) = 0.2*0.124938725751 ~= 0.025; log以10為底但由sklearn tfidf package計算出之tfidf值卻是0.43877674？請問理解上是否有誤？ ---- #文本 corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] #tf ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] [[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]] #sklearn tfidf package計算出之tfidf值 [[0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674] [0. 0.27230147 0. 0.27230147 0. 0.85322574 0.22262429 0. 0.27230147] [0.55280532 0. 0. 0. 0.55280532 0. 0.28847675 0.55280532 0. ] [0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]]

回答列表

2021/01/20 上午 11:03

魏培峰

贊同數：0

不贊同數：0

留言數：0

這是因為 TfidfTransformer 的作法跟標準的 td-idf 有幾點不同一、idf的算法會視傳入的參數而定 1. the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``) 2. idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1. (If ``smooth_idf=True`` (the default)) 二、在輸出時，會將結果進行 L2 (the default) or L1的正規化以課程的資料為例 tmp = count.toarray() tmp = tmp.astype('float') cnt_row = tmp.sum(axis=1) cnt_col = tmp.sum(axis=0) #### for smooth_idf=True for i in range(tmp.shape[0]): for j in range(tmp.shape[1]): tmp[i,j] = tmp[i,j]/cnt_row[i] * (np.log((4+1)/(cnt_col[j]+1))+1) #### for L2 for i in range(tmp.shape[0]): l2 = ((tmp[i]**2).sum())**0.5 tmp[i] = tmp[i]/l2 tmp的輸出結果會等同於TfidfTransformer的 default ps: TfidfTransformer 用的是 natural log，不是以10為base
2021/01/23 上午 02:20

張維元 (WeiYuan)

贊同數：0

不贊同數：0

留言數：0

嗨，你好
培峰的說明是正確的，你有其他想要追問的嗎？

嗨，你好，我是維元，持續在不同的平台發表對 #資料科學、 #網頁開發或 #軟體職涯相關的文章。如果對於內文有疑問都歡迎與我們進一步的交流，都可以追蹤我的 Facebook 或技術部落格，也會不定時的舉辦分享活動，一起來玩玩吧ヽ(●´∀`●)ﾉ