3 回答

TA貢獻1784條經(jīng)驗 獲得超2個贊
我發(fā)現(xiàn)的最佳解決方案是將自定義轉(zhuǎn)換器插入到 中Pipeline,在將輸出SimpleImputer從 2D 傳遞到 1D 之前將其重塑CountVectorizer。
這是完整的代碼:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)
pipe.fit_transform(df[['text']]).toarray()
GitHub上已經(jīng)提出只要CountVectorizer第二個維度為1(意思是:單列數(shù)據(jù))就應(yīng)該允許2D輸入。那個修改CountVectorizer將是這個問題的一個很好的解決方案!

TA貢獻1864條經(jīng)驗 獲得超6個贊
一種解決方案是創(chuàng)建一個 SimpleImputer 類并覆蓋其transform()方法:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
class ModifiedSimpleImputer(SimpleImputer):
def transform(self, X):
return super().transform(X).flatten()
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
imp = ModifiedSimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)
pipe.fit_transform(df[['text']]).toarray()

TA貢獻1942條經(jīng)驗 獲得超3個贊
當(dāng)我有一維數(shù)據(jù)時,我將這個一維包裝器用于 sklearn Transformer。我認為,在您的情況下,此包裝器可用于包裝一維數(shù)據(jù)(具有字符串值的 pandas 系列)的 simpleImputer。
class OneDWrapper:
"""One dimensional wrapper for sklearn Transformers"""
def __init__(self, transformer):
self.transformer = transformer
def fit(self, X, y=None):
self.transformer.fit(np.array(X).reshape(-1, 1))
return self
def transform(self, X, y=None):
return self.transformer.transform(
np.array(X).reshape(-1, 1)).ravel()
def inverse_transform(self, X, y=None):
return self.transformer.inverse_transform(
np.expand_dims(X, axis=1)).ravel()
現(xiàn)在,您不需要管道中的額外步驟。
one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))
pipe = make_pipeline(one_d_imputer, vect)
pipe.fit_transform(df['text']).toarray()
# note we are feeding a pd.Series here!
添加回答
舉報