You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jan Algermissen <al...@icloud.com> on 2016/01/07 00:13:29 UTC

Problems with too many checkpoint files with Spark Streaming

Hi,

we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time).

The checkpoint intervall is set to 20xBatchInterval, that is 400s.

Cluster size is 8 nodes.

We are having trouble with the amount of files and directories created on the shared file system (GlusterFS) - there are about 100 new directories per second.

Is that the expected magnitude of number of created directories? Or should we expect something different?

What might we be doing wrong?  Can anyone share a pointer to material that explains the details of checkpointing?

The checkpoint directories have UUIDs as names - ist that correct?

Jan





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Problems with too many checkpoint files with Spark Streaming

Posted by Jan Algermissen <al...@icloud.com>.

In the "UUID" folders there is the following:

ls -al /spark-shared/bsna-prod/792be037-d47a-41f8-ace3-2ee9ed715a1f/rdd-21522/

total 53169
drwxr-xr-x   2 deploy deploy    8192 Jan  7 00:01 .
drwxr-xr-x 148 deploy deploy    4096 Jan  7 00:05 ..
-rw-r--r--   1 root   root      3340 Jan  7 00:01 .part-00000.crc
-rw-r--r--   1 root   root      1972 Jan  7 00:01 .part-00001.crc
-rw-r--r--   1 root   root      4392 Jan  7 00:01 .part-00002.crc
.....
-rw-r--r--   1 root   root      5708 Jan  7 00:01 .part-00098.crc
-rw-r--r--   1 root   root      2416 Jan  7 00:01 .part-00099.crc
-rw-r--r--   1 root   root    426019 Jan  7 00:01 part-00000
-rw-r--r--   1 root   root    251305 Jan  7 00:01 part-00001
-rw-r--r--   1 root   root    560819 Jan  7 00:01 part-00002
-rw-r--r--   1 root   root    279837 Jan  7 00:01 part-00003
....

Jan

> On 07 Jan 2016, at 17:42, Jan Algermissen <al...@icloud.com> wrote:
> 
> 
>> On 07 Jan 2016, at 01:36, Tathagata Das <td...@databricks.com> wrote:
>> 
>> Could you show a sample of the file names? There are multiple things that are using UUIDs so would be good to see what are 100s of directories that being generated every second.
>> If you are checkpointing every 400s then there shouldnt be checkpoint directories written every second. They should be huge bunches written every 400s.
>> 
> 
> This is how it looks today:
> 
> drwxr-xr-x 148 deploy deploy  4096 Jan  7 00:05 792be037-d47a-41f8-ace3-2ee9ed715a1f
> drwxr-xr-x   4 deploy deploy    36 Jan  7 00:50 63df39fe-86ce-446d-869b-16107e6b2684
> drwxr-xr-x   4 deploy deploy    38 Jan  7 15:38 ca188fcb-50d1-4969-be46-4e649c7909bc
> drwxr-xr-x   4 deploy deploy    36 Jan  7 16:39 d10e7dc0-ebb7-487b-8d4e-a9626f409abc
> -rw-r--r--   1 deploy deploy 15869 Jan  7 17:33 checkpoint-1452184400000
> -rw-r--r--   1 deploy deploy 15881 Jan  7 17:33 checkpoint-1452184420000.bk
> -rw-r--r--   1 deploy deploy 15873 Jan  7 17:33 checkpoint-1452184420000
> -rw-r--r--   1 deploy deploy 15882 Jan  7 17:34 checkpoint-1452184440000.bk
> -rw-r--r--   1 deploy deploy 15872 Jan  7 17:34 checkpoint-1452184440000
> -rw-r--r--   1 deploy deploy 15882 Jan  7 17:34 checkpoint-1452184460000.bk
> -rw-r--r--   1 deploy deploy 15869 Jan  7 17:34 checkpoint-1452184460000
> -rw-r--r--   1 deploy deploy 15881 Jan  7 17:34 checkpoint-1452184480000.bk
> -rw-r--r--   1 deploy deploy 15872 Jan  7 17:34 checkpoint-1452184480000
> drwxr-xr-x   4 deploy deploy    36 Jan  7 17:34 3410c739-6793-4350-b08a-b1d4aae8d412
> -rw-r--r--   1 deploy deploy 15881 Jan  7 17:35 checkpoint-1452184500000
> drwxr-xr-x   2 deploy deploy  4096 Jan  7 17:35 receivedBlockMetadata
> 
> Seems the problem isn't present at the moment.
> 
> What is different today is that Spark is keeping up with the input. The recent days we experienced a lot of catch up processing and hence delays much longer than the streaming window of 20s.
> 
> Maybe delayed batches cause these files?
> 
> 
> Also, I now seem to understand that in situations of reprocessing the session-ttl of 30 Minutes (we keep sessions until there is 30Minutes of inactivity) makes the problem exponentially worse (more batches => more sessions to keep in updatestatebykey => less capability to process batches => more deferred batches ...)
> 
> Does that sound about right?
> 
> Jan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> On Wed, Jan 6, 2016 at 3:13 PM, Jan Algermissen <al...@icloud.com> wrote:
>> Hi,
>> 
>> we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time).
>> 
>> The checkpoint intervall is set to 20xBatchInterval, that is 400s.
>> 
>> Cluster size is 8 nodes.
>> 
>> We are having trouble with the amount of files and directories created on the shared file system (GlusterFS) - there are about 100 new directories per second.
>> 
>> Is that the expected magnitude of number of created directories? Or should we expect something different?
>> 
>> What might we be doing wrong?  Can anyone share a pointer to material that explains the details of checkpointing?
>> 
>> The checkpoint directories have UUIDs as names - ist that correct?
>> 
>> Jan
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Problems with too many checkpoint files with Spark Streaming

Posted by Jan Algermissen <al...@icloud.com>.

> On 07 Jan 2016, at 01:36, Tathagata Das <td...@databricks.com> wrote:
> 
> Could you show a sample of the file names? There are multiple things that are using UUIDs so would be good to see what are 100s of directories that being generated every second.
> If you are checkpointing every 400s then there shouldnt be checkpoint directories written every second. They should be huge bunches written every 400s.
> 

This is how it looks today:

drwxr-xr-x 148 deploy deploy  4096 Jan  7 00:05 792be037-d47a-41f8-ace3-2ee9ed715a1f
drwxr-xr-x   4 deploy deploy    36 Jan  7 00:50 63df39fe-86ce-446d-869b-16107e6b2684
drwxr-xr-x   4 deploy deploy    38 Jan  7 15:38 ca188fcb-50d1-4969-be46-4e649c7909bc
drwxr-xr-x   4 deploy deploy    36 Jan  7 16:39 d10e7dc0-ebb7-487b-8d4e-a9626f409abc
-rw-r--r--   1 deploy deploy 15869 Jan  7 17:33 checkpoint-1452184400000
-rw-r--r--   1 deploy deploy 15881 Jan  7 17:33 checkpoint-1452184420000.bk
-rw-r--r--   1 deploy deploy 15873 Jan  7 17:33 checkpoint-1452184420000
-rw-r--r--   1 deploy deploy 15882 Jan  7 17:34 checkpoint-1452184440000.bk
-rw-r--r--   1 deploy deploy 15872 Jan  7 17:34 checkpoint-1452184440000
-rw-r--r--   1 deploy deploy 15882 Jan  7 17:34 checkpoint-1452184460000.bk
-rw-r--r--   1 deploy deploy 15869 Jan  7 17:34 checkpoint-1452184460000
-rw-r--r--   1 deploy deploy 15881 Jan  7 17:34 checkpoint-1452184480000.bk
-rw-r--r--   1 deploy deploy 15872 Jan  7 17:34 checkpoint-1452184480000
drwxr-xr-x   4 deploy deploy    36 Jan  7 17:34 3410c739-6793-4350-b08a-b1d4aae8d412
-rw-r--r--   1 deploy deploy 15881 Jan  7 17:35 checkpoint-1452184500000
drwxr-xr-x   2 deploy deploy  4096 Jan  7 17:35 receivedBlockMetadata

Seems the problem isn't present at the moment.

What is different today is that Spark is keeping up with the input. The recent days we experienced a lot of catch up processing and hence delays much longer than the streaming window of 20s.

Maybe delayed batches cause these files?


Also, I now seem to understand that in situations of reprocessing the session-ttl of 30 Minutes (we keep sessions until there is 30Minutes of inactivity) makes the problem exponentially worse (more batches => more sessions to keep in updatestatebykey => less capability to process batches => more deferred batches ...)

Does that sound about right?

Jan










> On Wed, Jan 6, 2016 at 3:13 PM, Jan Algermissen <al...@icloud.com> wrote:
> Hi,
> 
> we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time).
> 
> The checkpoint intervall is set to 20xBatchInterval, that is 400s.
> 
> Cluster size is 8 nodes.
> 
> We are having trouble with the amount of files and directories created on the shared file system (GlusterFS) - there are about 100 new directories per second.
> 
> Is that the expected magnitude of number of created directories? Or should we expect something different?
> 
> What might we be doing wrong?  Can anyone share a pointer to material that explains the details of checkpointing?
> 
> The checkpoint directories have UUIDs as names - ist that correct?
> 
> Jan
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Problems with too many checkpoint files with Spark Streaming

Posted by Tathagata Das <td...@databricks.com>.

Could you show a sample of the file names? There are multiple things that
are using UUIDs so would be good to see what are 100s of directories that
being generated every second.
If you are checkpointing every 400s then there shouldnt be checkpoint
directories written every second. They should be huge bunches written every
400s.

On Wed, Jan 6, 2016 at 3:13 PM, Jan Algermissen <al...@icloud.com>
wrote:

> Hi,
>
> we are running a streaming job that processes about 500 events per 20s
> batches and uses updateStateByKey to accumulate Web sessions (with a 30
> Minute live time).
>
> The checkpoint intervall is set to 20xBatchInterval, that is 400s.
>
> Cluster size is 8 nodes.
>
> We are having trouble with the amount of files and directories created on
> the shared file system (GlusterFS) - there are about 100 new directories
> per second.
>
> Is that the expected magnitude of number of created directories? Or should
> we expect something different?
>
> What might we be doing wrong?  Can anyone share a pointer to material that
> explains the details of checkpointing?
>
> The checkpoint directories have UUIDs as names - ist that correct?
>
> Jan
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>