首頁猿問 Spark：在數(shù)據(jù)幀中反向直到滿足條件

Spark：在數(shù)據(jù)幀中反向直到滿足條件

Java

函數(shù)式編程 2022-07-06 18:57:50

我有以下格式的數(shù)據(jù)框：+----------+-------+----------+---------+-----------------------+---------+---------+|rownum |viewid |skillid |parentId |post_timestamp |is_skill |column A |+----------+-------+----------+---------+-----------------------+---------+---------+|1 |251 |b |xyz12 |2019-01-31 09:24:02.868|true |abcde ||2 |251 |b |abc34 |2019-01-31 10:24:02.868|false |453aw ||3 |251 |b |abc34 |2019-01-31 11:24:02.868|false |abcde ||4 |94 |a |ghi23 |2019-01-31 02:28:05.107|false |bbbbb ||5 |94 |a |yui67 |2019-01-31 09:06:57.976|true |nnnn ||6 |94 |a |qwe12 |2019-01-31 09:24:02.868|false |2n21q ||7 |94 |a |qwe12 |2019-01-31 10:06:57.976|false |nnnnq ||8 |94 |a |rty87 |2019-01-31 15:07:57.976|true |1234 ||9 |94 |a |bnm22 |2019-01-31 16:28:05.107|true |1234 ||10 |94 |a |bnm22 |2019-01-31 17:28:05.107|true |6789 ||11 |94 |b |tyu12 |2019-01-31 09:24:02.868|true |6789 |+----------+-------+----------+---------+-----------------------+---------+---------+對于一組viewidand skillid，如果當前行parentId不等于前一行parentId ，則在該組中找到最新的技能 ID 值為 true 的行，并檢查當前行的 columnA 值不等于該行的 columnA 值。Column matchedParentId = df.col("parentId").$eq$eq$eq(functions.lag("parentId",1);```Now how can I go back to the dataframe until skillId is true? I guess going back would be doable as the dataframe is ordered by timestamp.

查看完整描述

2 回答

溫溫醬

TA貢獻1752條經(jīng)驗獲得超4個贊

我使用 Scala，但我想出的解決方案是

- 在 parent_Id 不等于前一個 parent_Id 的行之前，使用窗口函數(shù)查找 is_skill = true 的最后一行的行號 - 自連接數(shù)據(jù)框以匹配行

期望的輸出如下？

+------+------+-------+--------+--------------------+--------+--------+---------------+--------+

+------+------+-------+--------+--------------------+--------+--------+---------------+--------+

| 1| 251| b| xyz12|20190131 09:24:02...| true| abcde| null| true|

| 2| 251| b| abc34|20190131 10:24:02...| false| 453aw| false| false|

| 3| 251| b| abc34|20190131 11:24:02...| false| abcde| true| true|

| 5| 94| a| yui67|20190131 09:06:57...| true| nnnn| false| true|

| 6| 94| a| qwe12|20190131 09:24:02...| false| 2n21q| false| false|

| 7| 94| a| qwe12|20190131 10:06:57...| false| nnnnq| true| false|

| 8| 94| a| rty87|20190131 15:07:57...| true| 1234| false| true|

| 9| 94| a| bnm22|20190131 16:28:05...| true| 1234| false| true|

| 10| 94| a| bnm22|20190131 17:28:05...| true| 6789| true| true|

| 11| 94| b| tyu12|20190131 09:24:02...| true| 6789| null| true|

+------+------+-------+--------+--------------------+--------+--------+---------------+--------+

這是代碼：

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.functions

import spark.implicits._

val df = Seq((1,251 ,"b" ,"xyz12" ,"20190131 09:24:02.868",true ,"abcde"),

(2 ,251 ,"b" ,"abc34" ,"20190131 10:24:02.868", false ,"453aw"),

(3 ,251 ,"b" ,"abc34" ,"20190131 11:24:02.868", false ,"abcde"),

(4 ,94 ,"a" ,"ghi23" ,"20190131 02:28:05.107", false ,"bbbbb"),

(5 ,94 ,"a" ,"yui67" ,"20190131 09:06:57.976", true ,"nnnn"),

(6 ,94 ,"a" ,"qwe12" ,"20190131 09:24:02.868", false ,"2n21q"),

(7 ,94 ,"a" ,"qwe12" ,"20190131 10:06:57.976", false ,"nnnnq"),

(8 ,94 ,"a" ,"rty87" ,"20190131 15:07:57.976", true ,"1234"),

(9 ,94 ,"a" ,"bnm22" ,"20190131 16:28:05.107", true ,"1234"),

(10 ,94 ,"a" ,"bnm22" ,"20190131 17:28:05.107",true ,"6789"),

(11 ,94 ,"b" ,"tyu12" ,"20190131 09:24:02.868",true ,"6789")).

toDF("rownum", "viewid", "skillid", "parentId" , "post_timestamp", "is_skill", "column A")

val w = Window.partitionBy("viewid", "skillid").orderBy("post_timestamp")

val df2 = df.withColumn("matchedParentId", lag($"parentId", 1).over(w).equalTo($"parentId")).

withColumn("rank", rank.over(w)).withColumn("is_skill_int", when($"is_skill", 1).otherwise(0)).

withColumn("test", max($"is_skill_int" * $"rank").over(w))

val df3 = df2.as("df_left").

join(df2.as("df_right"), $"df_left.viewid".equalTo($"df_right.viewid").

and($"df_left.skillid".equalTo($"df_right.skillid")).

and($"df_left.rank".equalTo($"df_right.test"))).

withColumn("isAEqual", $"df_left.column A".equalTo($"df_right.column A")).

select("df_right.rownum", "df_right.viewid", "df_right.skillid", "df_right.parentId", "df_right.post_timestamp", "df_right.is_skill", "df_right.column A", "df_right.matchedParentId", "isAEqual").

orderBy("rownum")

df3.show

反對回復(fù) 2022-07-06