You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stavros Kontopoulos (JIRA)" <ji...@apache.org> on 2019/06/02 11:41:00 UTC

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

    [ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853955#comment-16853955 ] 

Stavros Kontopoulos edited comment on SPARK-24815 at 6/2/19 11:40 AM:
----------------------------------------------------------------------

Yes its 1-1. What I am trying to say is that if you have N Spark tasks/partitions processed in parallel and you want to do dynamic re-partitioning (Spark side only, _repartitioning_ is an _expensive_ activity at the Kafka side afaik) because these N tasks got "fatter" you need the processing/batch_duration ratio metric, as your backlog task number is zero anyway, and you have no idea otherwise what is happening (unless you count the number of offsets/partition). After re-partitioning is done at Spark side you can fallback to the backlog metric (as now you will have tasks queued), but no reason, you could just keep the processing/batch_duration ratio. 

Regarding references there was a discussion last year: [http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: [https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 and related doc is here: [https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 


was (Author: skonto):
Yes its 1-1. What I am trying to say is that if you have N Spark tasks/partitions processed in parallel and you want to do dynamic re-partitioning (Spark side, _afaik repartitioning_ is an _expensive_ activity at the Kafka side) because these N tasks got "fatter" you need the processing/batch_duration ratio metric, as your backlog task number is zero anyway, and you have no idea otherwise what is happening (unless you count the number of offsets/partition). After re-partitioning is done at Spark side you can fallback to the backlog metric (as now you will have tasks queued), but no reason, you could just keep the processing/batch_duration ratio. 

Regarding references there was a discussion last year: [http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Discussion-Clarification-regarding-Stateful-Aggregations-over-Structured-Streaming-td25941.html#a25942]

There is this project that shows how the APIs can be extended: [https://github.com/chermenin/spark-states/tree/master/src/main/scala/ru/chermenin/spark/sql/execution/streaming/state.]

In general the source of truth is the code as you know. The related Jira for the state backend is here: https://issues.apache.org/jira/browse/SPARK-13809 and related doc is here: [https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254]

 

 

 

> Structured Streaming should support dynamic allocation
> ------------------------------------------------------
>
>                 Key: SPARK-24815
>                 URL: https://issues.apache.org/jira/browse/SPARK-24815
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler, Structured Streaming
>    Affects Versions: 2.3.1
>            Reporter: Karthik Palaniappan
>            Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing containers to match the actual workload. On multi-tenant clusters, it ensures that a Spark job is taking no more resources than necessary. In cloud environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured streaming job, the batch dynamic allocation algorithm kicks in. It requests more executors if the task backlog is a certain size, and removes executors if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a particular implementation in SparkContext.scala (this should be a separate JIRA).
> 2) We should make a structured streaming algorithm that's separate from the batch algorithm. Eventually, continuous processing might need its own algorithm.
> 3) Spark should print a warning if you run a structured streaming job when Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org