You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Anton Puzanov <an...@gmail.com> on 2018/01/22 08:59:12 UTC

Using window function works extremely slowly

I try to use spark sql built in window function:
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String)

I run it with step=1 seconds and window = 3 minutes (ratio of 180) and it
runs extremely slow compared to other methods (join & filter for example)

Example code:
Dataset:
+-------+-------------------+
|data   |timestamp          |
+-------+-------------------+
|data1|2017-12-28 11:23:10|
|data1|2017-12-28 11:23:11|
|data1|2017-12-28 11:23:19|
|data2|2017-12-28 11:23:13|
|data2|2017-12-28 11:23:14|
+-------+-------------------+
And a third column of features which doesn't show here.

Code:

private static String TIME_STEP_STRING = "1 seconds";
private static String TIME_WINDOW_STRING = "3 minutes";

Column slidingWindow = functions.window(data.col("timestamp"),
TIME_WINDOW_STRING, TIME_STEP_STRING);
Dataset<Row> data2 = data.withColumn("slide", slidingWindow);
Dataset<Row> finalRes = data2.groupBy(slidingWindow,
data2.col("data")).agg(functions.collect_set("features").as("feature_set")).cache();


Am I using it wrong? the situation is so bad I get
java.lang.OutOfMemoryError: Java heap space