首頁猿問 SparkSQL：如何處理用戶定義...

SparkSQL：如何處理用戶定義函數(shù)中的空值？

源碼算法與數(shù)據(jù)結(jié)構(gòu)

MYYA 2019-11-18 13:15:44

給定表1，其中一列“ x”為String類型。我想用“ y”列創(chuàng)建表2，該列是“ x”中給出的日期字符串的整數(shù)表示。重要的是將null值保留在“ y”列中。表1（數(shù)據(jù)框df1）：+----------+| x|+----------+|2015-09-12||2015-09-13|| null|| null|+----------+root |-- x: string (nullable = true)表2（資料框df2）：+----------+--------+ | x| y|+----------+--------+| null| null|| null| null||2015-09-12|20150912||2015-09-13|20150913|+----------+--------+root |-- x: string (nullable = true) |-- y: integer (nullable = true)用戶定義的函數(shù)（udf）將“ x”列的值轉(zhuǎn)換為“ y”列的值是：val extractDateAsInt = udf[Int, String] ( (d:String) => d.substring(0, 10) .filterNot( "-".toSet) .toInt )并且有效，無法處理空值。即使，我可以做類似的事情val extractDateAsIntWithNull = udf[Int, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else 1 )我發(fā)現(xiàn)沒有辦法null通過udfs “產(chǎn)生” 值（當(dāng)然，因?yàn)镮nts不能null）。我當(dāng)前用于創(chuàng)建df2的解決方案（表2）如下：// holds data of table 1 val df1 = ... // filter entries from df1, that are not nullval dfNotNulls = df1.filter(df1("x") .isNotNull) .withColumn("y", extractDateAsInt(df1("x"))) .withColumnRenamed("x", "right_x")// create df2 via a left join on df1 and dfNotNull having val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")問題：當(dāng)前的解決方案似乎很麻煩（并且可能無法有效地提高性能）。有沒有更好的辦法？@ Spark-developers：是否有NullableInt計(jì)劃/可用的類型，以便可以使用以下udf（請參見代碼摘錄）？代碼摘錄val extractDateAsNullableInt = udf[NullableInt, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else null )

查看完整描述

3 回答

守著星空守著你

TA貢獻(xiàn)1799條經(jīng)驗(yàn) 獲得超8個(gè)贊

這是Option派上用場的地方：

val extractDateAsOptionInt = udf((d: String) => d match {

case null => None

case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)

})

或在一般情況下使其更加安全：

import scala.util.Try

val extractDateAsOptionInt = udf((d: String) => Try(

d.substring(0, 10).filterNot("-".toSet).toInt

).toOption)

一切歸功于德米特里Selivanov誰已經(jīng)指出，這種解決方案為（失蹤？）編輯這里。

另一種方法是null在UDF之外處理：

import org.apache.spark.sql.functions.{lit, when}

import org.apache.spark.sql.types.IntegerType

val extractDateAsInt = udf(

(d: String) => d.substring(0, 10).filterNot("-".toSet).toInt

)

df.withColumn("y",

when($"x".isNull, lit(null))

.otherwise(extractDateAsInt($"x"))

.cast(IntegerType)

)

反對回復(fù) 2019-11-18

嚕嚕噠

TA貢獻(xiàn)1784條經(jīng)驗(yàn) 獲得超7個(gè)贊

我創(chuàng)建了以下代碼，以使用戶定義的函數(shù)可用，以處理所述的空值。希望對別人有幫助！

/**

* Set of methods to construct [[org.apache.spark.sql.UserDefinedFunction]]s that

* handle `null` values.

*/

object NullableFunctions {

import org.apache.spark.sql.functions._

import scala.reflect.runtime.universe.{TypeTag}

import org.apache.spark.sql.UserDefinedFunction

/**

* Given a function A1 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that

* * if fnc input is null, None is returned. This will create a null value in the output Spark column.

* * if A1 is non null, Some( f(input) will be returned, thus creating f(input) as value in the output column.

* @param f function from A1 => RT

* @tparam RT return type

* @tparam A1 input parameter type

* @return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above

*/

def nullableUdf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {

udf[Option[RT],A1]( (i: A1) => i match {

case null => None

case s => Some(f(i))

})

}

/**

* Given a function A1, A2 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that

* * if on of the function input parameters is null, None is returned.

* This will create a null value in the output Spark column.

* * if both input parameters are non null, Some( f(input) will be returned, thus creating f(input1, input2)

* as value in the output column.

* @param f function from A1 => RT

* @tparam RT return type

* @tparam A1 input parameter type

* @tparam A2 input parameter type

* @return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above

*/

def nullableUdf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): UserDefinedFunction = {

udf[Option[RT], A1, A2]( (i1: A1, i2: A2) => (i1, i2) match {

case (null, _) => None

case (_, null) => None

case (s1, s2) => Some((f(s1,s2)))

} )

}

反對回復(fù) 2019-11-18

3 回答
0 關(guān)注
1299 瀏覽

關(guān)注

添加回答

舉報(bào)

0/150

提交

取消

使用 Ctrl+D 可將網(wǎng)站添加到書簽

微信客服

購課補(bǔ)貼
聯(lián)系客服咨詢優(yōu)惠詳情

幫助反饋 APP下載

慕課網(wǎng)APP
您的移動(dòng)學(xué)習(xí)伙伴

公眾號(hào)

掃描二維碼
關(guān)注慕課網(wǎng)微信公眾號(hào)

第七色在线视频,2021少妇久久久久久久久久,亚洲欧洲精品成人久久av18,亚洲国产精品特色大片观看完整版,孙宇晨将参加特朗普的晚宴

熱搜

最近搜索清空

SparkSQL：如何處理用戶定義函數(shù)中的空值？

SparkSQL：如何處理用戶定義函數(shù)中的空值？

3 回答

添加回答

SparkSQL：如何處理用戶定義函數(shù)中的空值？

SparkSQL：如何處理用戶定義函數(shù)中的空值？