首頁猿問如何在 scikit-learn...

如何在 scikit-learn 管道中的 CountVectorizer 之前包含

Python

楊__羊羊 2023-03-22 16:56:58

我有一個DataFrame包含一列文本的熊貓，我想使用 scikit-learn對文本進行矢量化CountVectorizer。但是，文本包含缺失值，因此我想在矢量化之前估算一個常數(shù)值。我最初的想法是創(chuàng)建一個PipelineofSimpleImputer和CountVectorizer：import pandas as pdimport numpy as npdf = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})from sklearn.impute import SimpleImputerimp = SimpleImputer(strategy='constant')from sklearn.feature_extraction.text import CountVectorizervect = CountVectorizer()from sklearn.pipeline import make_pipelinepipe = make_pipeline(imp, vect)pipe.fit_transform(df[['text']]).toarray()但是，fit_transform錯誤是因為SimpleImputeroutputs a 2D array and CountVectorizerrequires 1D input。這是錯誤消息：AttributeError: 'numpy.ndarray' object has no attribute 'lower'問題：我如何修改它Pipeline才能使其正常工作？注意：我知道我可以估算熊貓中的缺失值。但是，我想在 scikit-learn 中完成所有預(yù)處理，以便可以使用Pipeline.

查看完整描述

3 回答

一只斗牛犬

TA貢獻1784條經(jīng)驗獲得超2個贊

我發(fā)現(xiàn)的最佳解決方案是將自定義轉(zhuǎn)換器插入到中Pipeline，在將輸出SimpleImputer從 2D 傳遞到 1D 之前將其重塑CountVectorizer。

這是完整的代碼：

import pandas as pd

import numpy as np

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

# CREATE TRANSFORMER

from sklearn.preprocessing import FunctionTransformer

one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

GitHub上已經(jīng)提出只要CountVectorizer第二個維度為1（意思是：單列數(shù)據(jù)）就應(yīng)該允許2D輸入。那個修改CountVectorizer將是這個問題的一個很好的解決方案！

反對回復(fù) 2023-03-22

慕尼黑的夜晚無繁華

TA貢獻1864條經(jīng)驗獲得超6個贊

一種解決方案是創(chuàng)建一個 SimpleImputer 類并覆蓋其transform()方法：

import pandas as pd

import numpy as np

from sklearn.impute import SimpleImputer

class ModifiedSimpleImputer(SimpleImputer):

def transform(self, X):

return super().transform(X).flatten()

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

反對回復(fù) 2023-03-22

手掌心

TA貢獻1942條經(jīng)驗獲得超3個贊

當(dāng)我有一維數(shù)據(jù)時，我將這個一維包裝器用于 sklearn Transformer。我認為，在您的情況下，此包裝器可用于包裝一維數(shù)據(jù)（具有字符串值的 pandas 系列）的 simpleImputer。

class OneDWrapper:

"""One dimensional wrapper for sklearn Transformers"""

def __init__(self, transformer):

self.transformer = transformer

def fit(self, X, y=None):

self.transformer.fit(np.array(X).reshape(-1, 1))

return self

def transform(self, X, y=None):

return self.transformer.transform(

np.array(X).reshape(-1, 1)).ravel()

def inverse_transform(self, X, y=None):

return self.transformer.inverse_transform(

np.expand_dims(X, axis=1)).ravel()

現(xiàn)在，您不需要管道中的額外步驟。

one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))

pipe = make_pipeline(one_d_imputer, vect)

pipe.fit_transform(df['text']).toarray()

# note we are feeding a pd.Series here!

反對回復(fù) 2023-03-22

3 回答
0 關(guān)注
223 瀏覽

關(guān)注

添加回答

舉報

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動學(xué)習(xí)伙伴

公眾號

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

如何在 scikit-learn 管道中的 CountVectorizer 之前包含

如何在 scikit-learn 管道中的 CountVectorizer 之前包含

3 回答

添加回答