出現cannot reindex from a duplicate axis問題

2020/03/31 下午 04:36

機器學習共學討論版

Ava Chen

觀看數：135

回答數：5

收藏數：1

程式在跑到

df_temp = pd.DataFrame()

for c in object_features:

df_temp[c] = LabelEncoder().fit_transform(df[c])

df_temp['Cabin_Hash'] = df['Cabin'].map(lambda x:hash(x) % 10)

train_X = df_temp[:train_num]

estimator = LogisticRegression()

print(cross_val_score(estimator, train_X, train_Y, cv=5).mean())

df_temp.head()

出現ValueError: cannot reindex from a duplicate axis

請問reindex的原因是在哪？

======================================

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-55c94d744486> in <module>
     2 for c in object_features:
     3     df_temp[c] = LabelEncoder().fit_transform(df[c])
----> 4 df_temp['Cabin_Hash'] = df['Cabin'].map(lambda x:hash(x) % 10)
     5 train_X = df_temp[:train_num]
     6 estimator = LogisticRegression()

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
  3470         else:
  3471             # set column
-> 3472             self._set_item(key, value)
  3473
  3474     def _setitem_slice(self, key, value):

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
  3547
  3548         self._ensure_valid_index(value)
-> 3549         value = self._sanitize_column(key, value)
  3550         NDFrame._set_item(self, key, value)
  3551

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
  3709
  3710         if isinstance(value, Series):
-> 3711             value = reindexer(value)
  3712
  3713         elif isinstance(value, DataFrame):

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in reindexer(value)
  3700                     # duplicate axis
  3701                     if not value.index.is_unique:
-> 3702                         raise e
  3703
  3704                     # other

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in reindexer(value)
  3695                 # GH 4107
  3696                 try:
-> 3697                     value = value.reindex(self.index)._values
  3698                 except Exception as e:
  3699

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in reindex(self, index, **kwargs)
  4216     @Appender(generic.NDFrame.reindex.__doc__)
  4217     def reindex(self, index=None, **kwargs):
-> 4218         return super().reindex(index=index, **kwargs)
  4219
  4220     def drop(

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in reindex(self, *args, **kwargs)
  4512         # perform the reindex on the axes
  4513         return self._reindex_axes(
-> 4514             axes, level, limit, tolerance, method, fill_value, copy
  4515         ).__finalize__(self)
  4516

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
  4533                 fill_value=fill_value,
  4534                 copy=copy,
-> 4535                 allow_dups=False,
  4536             )
  4537

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
  4575                 fill_value=fill_value,
  4576                 allow_dups=allow_dups,
-> 4577                 copy=copy,
  4578             )
  4579

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy)
  1249         # some axes don't allow reindexing with dups
  1250         if not allow_dups:
-> 1251             self.axes[axis]._can_reindex(indexer)
  1252
  1253         if axis >= self.ndim:

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
  3360         # trying to reindex on an axis with duplicates
  3361         if not self.is_unique and len(indexer):
-> 3362             raise ValueError("cannot reindex from a duplicate axis")
  3363
  3364     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

========================================

另外有嘗試先用Labelencoding後的df_temp做hash轉換，但會不會和原本直接用字串做hash有差別呢？

回答列表

2020/03/31 下午 06:36

Jeffrey

贊同數：1

不贊同數：0

留言數：1

DataFrame存在重複索引時, 會有這種錯誤訊息, 可以嘗試使用" df_temp.index.duplicated()" 查詢
2020/04/05 上午 04:05

張維元 (WeiYuan)

贊同數：0

不贊同數：0

留言數：0

嗨，Ava Chen・

第一個問題「ValueError: cannot reindex from a duplicate axis」是因為資料沒有對上導致的，可能要先檢查是不是哪邊重複了。

如果這個回答對你有幫助請主動點選「有幫助」的按鈕，也可以追蹤我的GITHUB帳號。若還有問題的話，也歡迎繼續再追問或者把你理解的部分整理上來，我都會提供你 Review 和 Feedback 😃😃😃
2020/04/05 上午 04:06

張維元 (WeiYuan)

贊同數：1

不贊同數：0

留言數：1

嗨，Ava Chen・

第二個問題「另外有嘗試先用Labelencoding後的df_temp做hash轉換，但會不會和原本直接用字串做hash有差別呢？」

=> 這裡要看你的 hash 方式，如果是純字串是無法做 % 運算的（必須換其他種 hash）。

如果這個回答對你有幫助請主動點選「有幫助」的按鈕，也可以追蹤我的GITHUB帳號。若還有問題的話，也歡迎繼續再追問或者把你理解的部分整理上來，我都會提供你 Review 和 Feedback 😃😃😃
2020/04/10 上午 00:16

張維元 (WeiYuan)

贊同數：0

不贊同數：0

留言數：1

嗨，Ava

「喔喔喔！不好意思換個方式問，應該說我試了兩個方式，結果<方法二>cross_val_score成績比較高，但怎麼確定<方法二>這個方法是合理可行的？還是說基本上可以work成績比較高的就是可用的方法嗎？」

=> 請問這裡的兩種方法差別是什麼呢？差別是 Cabin 有沒有做過 Labelencode 嗎？

如果這個回答對你有幫助請主動點選「有幫助」的按鈕，也可以追蹤我的GITHUB帳號。若還有問題的話，也歡迎繼續再追問或者把你理解的部分整理上來，我都會提供你 Review 和 Feedback 😃😃😃
2020/04/12 上午 02:35

張維元 (WeiYuan)

贊同數：1

不贊同數：0

留言數：0

「對的！ <方法一>是df['Cabin'] -> Hash <方法一>是df['Cabin'] -> Labelencode -> Hash 但我不確定<方法二>是否可行」

=> 可行，而且後者比較好。

如果這個回答對你有幫助請主動點選「有幫助」的按鈕，也可以追蹤我的GITHUB帳號。若還有問題的話，也歡迎繼續再追問或者把你理解的部分整理上來，我都會提供你 Review 和 Feedback 😃😃😃