You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Karthik Palaniappan (JIRA)" <ji...@apache.org> on 2019/06/01 16:09:00 UTC

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

    [ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853745#comment-16853745 ] 

Karthik Palaniappan commented on SPARK-24815:
---------------------------------------------

Just to clarify: does Spark allow having multiple tasks per Kafka partition? This doc implies that they are 1:1: [https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html]. Batch dynamic allocation gives you enough executors to process all tasks in running stage in parallel. I assume you end up with too much data per partition, your only recourse would be to increase the number of kafka partitions.

Also, does Spark not handle state rebalancing today? In other words, is SS already fault tolerant to node failures?

Do you have any good references (JIRAs, design docs) on how Spark stores streaming state internally? I see plenty of blogs and articles on how to do stateful processing, but not on how it works under the hood.

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing containers to match the actual workload. On multi-tenant clusters, it ensures that a Spark job is taking no more resources than necessary. In cloud environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured streaming job, the batch dynamic allocation algorithm kicks in. It requests more executors if the task backlog is a certain size, and removes executors if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a particular implementation in SparkContext.scala (this should be a separate JIRA).
> 2) We should make a structured streaming algorithm that's separate from the batch algorithm. Eventually, continuous processing might need its own algorithm.
> 3) Spark should print a warning if you run a structured streaming job when Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org