aggregate用法及LabelEncoding問題
1. 範例說明內的In[5] 其中 " #例如 df.groupby(['Ticket']).size(), 但欄位名稱會變成 size "
這句說明跟這裡有什麼關聯? 後面的.agg用法是什麼
2.HW報錯 附上程式碼
報錯:
--中間放留言---
另外想請問 今天課程看起來有點混亂,該如何統整內容?
回答列表
-
2020/03/27 下午 00:30yicchen贊同數:0不贊同數:0留言數:0C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:530: FutureWarning: From version 0.22, errors during fit will result in a cross validation score of NaN by default. Use error_score='raise' if you want an exception raised or error_score=np.nan to adopt the behavior from version 0.22.
FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-b62b2faef2b7> in <module>
6 train_X = df_temp[:train_num]
7 estimator = LogisticRegression()
----> 8 print(cross_val_score(estimator, train_X, train_Y, cv=5).mean())
9 df_temp.head()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
389 fit_params=fit_params,
390 pre_dispatch=pre_dispatch,
--> 391 error_score=error_score)
392 return cv_results['test_score']
393
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
230 return_times=True, return_estimator=return_estimator,
231 error_score=error_score)
--> 232 for train, test in cv.split(X, y, groups))
233
234 zipped_scores = list(zip(*scores))
C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
919 # remaining jobs.
920 self._iterating = False
--> 921 if self.dispatch_one_batch(iterator):
922 self._iterating = self._original_iterator is not None
923
C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
C:\ProgramData\Anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
C:\ProgramData\Anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
514 estimator.fit(X_train, **fit_params)
515 else:
--> 516 estimator.fit(X_train, y_train, **fit_params)
517
518 except Exception as e:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1530
1531 X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
-> 1532 accept_large_sparse=solver != 'liblinear')
1533 check_classification_targets(y)
1534 self.classes_ = np.unique(y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
717 ensure_min_features=ensure_min_features,
718 warn_on_dtype=warn_on_dtype,
--> 719 estimator=estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
494 try:
495 warnings.simplefilter('error', ComplexWarning)
--> 496 array = np.asarray(array, dtype=dtype, order=order)
497 except ComplexWarning:
498 raise ValueError("Complex data not supported\n"
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not convert string to float: 'None' -
2020/03/27 下午 04:45Justin贊同數:3不贊同數:0留言數:0
哈囉~
1. 關於 HW 會出錯是因為放錯欄位了喔!應該要把In [5]的第5行改成
df_temp["Cabin_Count'] = df["Cabin_Count"]
原本的寫法會導致用到未 LabelEncoding 的 Cabin 欄位,也就是仍處於Object type的欄位,因此出錯。
2. agg 其實就是 aggregate,我會建議看一下 Document。另外,範例的用法是對選取到的 column 使用 size函數整理起來做新的一行,且欄位名稱是 Ticket_Count。不過用字典的方法比較舊了,應該使用 .agg(Ticket_Count='size')
3. 這邊是我個人的小見解,如果要統整的話,可以跟前面幾天的編碼進行比較,並瞭解每種編碼會用在哪些地方,有哪些效果等等~
-
2020/03/28 下午 04:10張維元 (WeiYuan)贊同數:1不贊同數:0留言數:0
嗨,補充一下這一段操作的分解動作:
1. df.groupby(['Ticket']) => 先利用 Ticket 對 df 分組,所以這邊產生的結果應該是 Ticket = OOO 的一組、Ticket = XXX 的一組。
2. ['Name'] => 這分組後的名稱取出來
3. .agg({'Ticket_Count':'size'}) => 計算每一組分別有幾筆(size)
4. .reset_index() => 重新設定 index(因為在分組的過程會亂掉)
如果這個回答對你有幫助請主動點選「有幫助」的按鈕,也可以追蹤我的GITHUB帳號。若還有問題的話,也歡迎繼續再追問或者把你理解的部分整理上來,我都會提供你 Review 和 Feedback 😃😃😃