You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by tleilaxu <tl...@gmail.com> on 2020/05/11 21:18:27 UTC

GrupState limits

Hi,
I am tracking states in my Spark streaming application with
MapGroupsWithStateFunction described here:
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
Which are the limiting factors on the number of states a job can track at
the same time? Is it memory? Could be a bounded data structure in the
internal implementation? Anything else ...
You might have valuable input on this while I am trying to setup and test
this.

Thanks,
Arnold

Re: GrupState limits

Posted by Srinivas V <sr...@gmail.com>.

If you are talking about total number of objects the state can hold, that
depends on the executor memory you have on your cluster apart from rest of
the memory required for processing. The state is stored in hdfs and
retrieved while processing the next events.
If you maintain million objects with each 20 bytes , it would be 20MB,
which is pretty reasonable to maintain in a executor allocated with few GB
memory. But if you need heavy objects to be stored you need to do the math.
And also it will have a cost in transferring this data back and forth to
hdfs checkpoint location.

Regards
Srini

On Tue, May 12, 2020 at 2:48 AM tleilaxu <tl...@gmail.com> wrote:

> Hi,
> I am tracking states in my Spark streaming application with
> MapGroupsWithStateFunction described here:
> https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
> Which are the limiting factors on the number of states a job can track at
> the same time? Is it memory? Could be a bounded data structure in the
> internal implementation? Anything else ...
> You might have valuable input on this while I am trying to setup and test
> this.
>
> Thanks,
> Arnold
>