You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:23:13 UTC

[jira] [Updated] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation

     [ https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-15272:
---------------------------------
    Labels: bulk-closed  (was: )

> DirectKafkaInputDStream doesn't work with window operation
> ----------------------------------------------------------
>
>                 Key: SPARK-15272
>                 URL: https://issues.apache.org/jira/browse/SPARK-15272
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams
>    Affects Versions: 1.5.2
>            Reporter: Lubomir Nerad
>            Priority: Major
>              Labels: bulk-closed
>
> Using Kafka direct {{DStream}} with simple window operation like:
> {code:java}
> kafkaDStream.window(Durations.milliseconds(10000),
>                     Durations.milliseconds(1000));
>             .print();
> {code}
> with 1s batch duration either freezes after several seconds or lags terribly (depending on cluster mode).
> This happens when Kafka brokers are not part of the Spark cluster (they are on different nodes). The {{KafkaRDD}} still reports them as preferred locations. This doesn't seem to be problem in non-window scenarios but with window it conflicts with delay scheduling algorithm implemented in {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are needed to process tasks with these Kafka broker aligned preferred locations. When delay scheduling algorithm is switched off ({{spark.locality.wait=0}}), the example works correctly.
> I think that the {{KafkaRDD}} shouldn't report preferred locations if the brokers don't correspond to worker nodes or allow the reporting of preferred locations to be switched off. Also it would be good if delay scheduling algorithm didn't drain / delay offers in the case, the tasks have unmatched preferred locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org