首頁(yè) 猿問(wèn) PySpark 2.2...

PySpark 2.2 爆炸刪除空行（如何實(shí)現(xiàn)explode_outer）？

Python

尚方寶劍之說(shuō) 2021-07-09 14:06:32

我正在 PySpark 數(shù)據(jù)框中處理一些深度嵌套的數(shù)據(jù)。當(dāng)我試圖將結(jié)構(gòu)展平為行和列時(shí)，我注意到當(dāng)我調(diào)用withColumn該行是否包含null在源列中時(shí)，該行將從我的結(jié)果數(shù)據(jù)框中刪除。相反，我想找到一種方法來(lái)保留該行并null在結(jié)果列中包含該行。要使用的示例數(shù)據(jù)框：from pyspark.sql.functions import explode, first, col, monotonically_increasing_idfrom pyspark.sql import Rowdf = spark.createDataFrame([ Row(dataCells=[Row(posx=0, posy=1, posz=.5, value=1.5, shape=[Row(_type='square', _len=1)]), Row(posx=1, posy=3, posz=.5, value=4.5, shape=[]), Row(posx=2, posy=5, posz=.5, value=7.5, shape=[Row(_type='circle', _len=.5)]) ])])我還有一個(gè)用來(lái)壓平結(jié)構(gòu)的函數(shù)：def flatten_struct_cols(df): flat_cols = [column[0] for column in df.dtypes if 'struct' not in column[1][:6]] struct_columns = [column[0] for column in df.dtypes if 'struct' in column[1][:6]] df = df.select(flat_cols + [col(sc + '.' + c).alias(sc + '_' + c) for sc in struct_columns for c in df.select(sc + '.*').columns]) return df架構(gòu)如下所示：df.printSchema()root |-- dataCells: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- posx: long (nullable = true) | | |-- posy: long (nullable = true) | | |-- posz: double (nullable = true) | | |-- shape: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- _len: long (nullable = true) | | | | |-- _type: string (nullable = true) | | |-- value: double (nullable = true)

查看完整描述