You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/13 18:03:55 UTC

[GitHub] taosaildrone edited a comment on issue #23762: [SPARK-21492][SQL][WIP] Memory leak in SortMergeJoin

taosaildrone edited a comment on issue #23762: [SPARK-21492][SQL][WIP] Memory leak in SortMergeJoin
URL: https://github.com/apache/spark/pull/23762#issuecomment-463303175
 
 
   There's a task completion listener, but 1) we could hit an OOME before the task completes or 2) impact performance by holding unnecessary memory and causing a bunch of unneeded spills. 
   It's holding onto memory that is no longer needed, reducing available memory to the point that the application fails when it tries to allocate memory that's actually needed. I would classify that as a memory leak.
   
   Something as simple as joining a DF with 1000 rows with another DF of 2 rows (with one overlapping row_id) will cause an OOME (both locally, and on a cluster) and cause the job to fail:
   
   An example locally: ./bin/pyspark --master local[10] (tested on 2.4.0 and 3.0.0-master)
   `
   
   from pyspark.sql.functions import rand, col
   
   spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
   spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
   
   r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
   r1 = r1.withColumn('value', rand())
   r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
   r2 = r2.withColumn('value2', rand())
   joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
   joined = joined.coalesce(1)
   joined.explain()
   joined.show()
   ` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org