You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bram van den Akker (Jira)" <ji...@apache.org> on 2020/10/03 13:33:00 UTC

[jira] [Created] (SPARK-33060) approxSimilarityJoin in Structured Stream causes state to explode in size

Bram van den Akker created SPARK-33060:
------------------------------------------

             Summary: approxSimilarityJoin in Structured Stream causes state to explode in size
                 Key: SPARK-33060
                 URL: https://issues.apache.org/jira/browse/SPARK-33060
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark, Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Bram van den Akker


I'm writing a PySpark application that joins a static and streaming dataframe together using the approxSimilarityJoin function from the ML package. Because of the high volume of data, we need to apply a watermark to make sure a minimal amount of state is preserved. However, the [approxSimilarityJoin scala code contains a `distinct` action|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala#L289]  right after it joins the two datasets together. This call results in a state being created to account for late arriving data. 

Watermarks created in the PySpark code are being ignored and still lead to the state accumulating in size. 

My expectation is that the watermarking is lost in between the communication from Python to Scala. 

I've created [this Stackoverflow question|https://stackoverflow.com/questions/64157104/stream-static-join-without-aggregation-still-results-in-accumulating-spark-state] earlier this week, but after more investigation this really seem like a bug rather than a user error.
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org