You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mathieu DESPRIEE (JIRA)" <ji...@apache.org> on 2018/01/25 16:35:00 UTC

[jira] [Created] (SPARK-23220) broadcast hint not applied in a streaming left anti join

Mathieu DESPRIEE created SPARK-23220:
----------------------------------------

             Summary: broadcast hint not applied in a streaming left anti join
                 Key: SPARK-23220
                 URL: https://issues.apache.org/jira/browse/SPARK-23220
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.2.1
            Reporter: Mathieu DESPRIEE
         Attachments: Screenshot from 2018-01-25 17-32-45.png

We have a structured streaming app doing a left anti-join between a stream, and a static dataframe. This one is quite small (a few 100s of rows), but he query plan by default is a sort merge join.
 
It happens sometimes we need to re-process some historical data, so we feed the same app with a FileSource pointing to our S3 storage with all archives. In that situation, the first mini-batch is quite heavy (several 100'000s of input files), and the time spent in sort-merge join is non-acceptable.

I tried to switch to a broadcast join, but Spark still apply a sort-merge.
{noformat}
ds.join(broadcast(hostnames), Seq("hostname"), "leftanti")
{noformat}

Looks like a bug. Is there another way to force the broadcast ?






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org