You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/11/16 22:50:01 UTC

[jira] [Assigned] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

     [ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-18475:
------------------------------------

    Assignee: Apache Spark

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>            Assignee: Apache Spark
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we shouldn't be able to increase parallelism further, i.e. have multiple Spark tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" for what it is defined for (being cached) in this use case, but the extra overhead is worth handling data skew and increasing parallelism especially in ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org