首頁猿問在PySpark中編碼和組合多個功能

在PySpark中編碼和組合多個功能

Python

湖上湖 2019-10-09 17:37:11

我有一個Python類，用于在Spark中加載和處理一些數(shù)據(jù)。在需要做的各種事情中，我正在生成一個從Spark數(shù)據(jù)幀中各個列派生的偽變量列表。我的問題是我不確定如何正確定義用戶定義函數(shù)來完成我所需要的。我目前確實有一種方法，當將其映射到基礎(chǔ)數(shù)據(jù)幀RDD上時，可以解決一半的問題（請記住，這是較大data_processor類中的方法）：def build_feature_arr(self,table): # this dict has keys for all the columns for which I need dummy coding categories = {'gender':['1','2'], ..} # there are actually two differnt dataframes that I need to do this for, this just specifies which I'm looking at, and grabs the relevant features from a config file if table == 'users': iter_over = self.config.dyadic_features_to_include elif table == 'activty': iter_over = self.config.user_features_to_include def _build_feature_arr(row): result = [] row = row.asDict() for col in iter_over: column_value = str(row[col]).lower() cats = categories[col] result += [1 if column_value and cat==column_value else 0 for cat in cats] return result return _build_feature_arr從本質(zhì)上講，對于指定的數(shù)據(jù)幀，此操作將獲取指定列的分類變量值，并返回這些新虛擬變量的值的列表。這意味著以下代碼：data = data_processor(init_args)result = data.user_data.rdd.map(self.build_feature_arr('users'))返回類似：In [39]: result.take(10)Out[39]:[[1, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0], [1, 0, 0, 1, 0, 0], [1, 0, 0, 1, 0, 0], [0, 1, 1, 0, 0, 0], [1, 0, 1, 1, 0, 0], [1, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0, 1]]就生成所需的虛擬變量列表而言，這正是我想要的，但這是我的問題：我如何（a）制作具有可以在Spark SQL查詢中使用的類似功能的UDF（或其他方法），我想），或（b）提取上述映射得出的RDD并將其作為新列添加到user_data數(shù)據(jù)幀？無論哪種方式，我需要做的是生成一個新的數(shù)據(jù)框，其中包含來自user_data的列，以及一個feature_array包含上述函數(shù)的輸出（或功能等效的東西）的新列（我們稱之為）。

查看完整描述

2 回答

手掌心

TA貢獻1942條經(jīng)驗獲得超3個贊

我有一個問題....如果我對這些數(shù)據(jù)運行randomforest_Classifier，我將獲得基于數(shù)字的隨機森林葉子（由于索引）。如何將其與原始說明（即英文文本）巧妙地聯(lián)系在一起。例如，隨機森林分類器沒有肉類數(shù)據(jù)，這成為一項艱巨的任務(wù)。我有一個模糊的主意，我必須使用類似IndexToString（）的方法，但是我不確定如何使用它

反對回復(fù) 2019-10-09