You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 03:59:40 UTC
[jira] [Updated] (SPARK-22046) Streaming State cannot be scalable
[ https://issues.apache.org/jira/browse/SPARK-22046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-22046:
---------------------------------
Labels: bulk-closed (was: )
> Streaming State cannot be scalable
> ----------------------------------
>
> Key: SPARK-22046
> URL: https://issues.apache.org/jira/browse/SPARK-22046
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 2.2.0
> Environment: OS: amazon linux,
> Streaming Source: kafka 0.10
> vm: aws ec2
> cluster resources: 16Gb per worker, single executor per worker, 8 cores per executor
> storage: hdfs
> Reporter: danny mor
> Priority: Minor
> Labels: bulk-closed
>
> State cannot be distributed on the cluster.
> When the {color:#59afe1}StateStoreRDD{color}'s {color:#59afe1}getPrefferedLocation {color}is called it
> creates a {color:#59afe1} StateStoreId(checkpointLocation, operatorId, partition.index){color},
> send it to the {color:#59afe1}StateStoreCoordinator {color},which holds a hashmap of {color:#59afe1}StateStoreId {color}to {color:#59afe1}ExecutorCacheTaskLocation{color}, and returns the executorId if it is cached.
> the operatorId is generated once every batch in the {color:#59afe1}IncrementalExecution {color}instance
> but it is almost always 0 since {color:#59afe1}IncrementalExecution {color}is instantiated each batch
> the partition index is limited to the configured value {color:#14892c}"spark.sql.shuffle.partitions"{color} (in my case the default 200)
> so this limits cache to 200 entries which has no regard to the key itself .
> When introducing new Executors to the cluster and new keys to streaming data, it does not effect the distribution of state because the {color:#59afe1}StateStoreId {color}does not regard those variables.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org