You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by sidhartha saurav <si...@gmail.com> on 2019/07/29 22:13:54 UTC

StreamingFileSink part file count reset

Hi,

We are using StreamingFileSink with a custom BucketAssigner and
DefaultRollingPolicy. The custom BucketAssigner is simply a date bucket
assigner. The StreamingFileSink creates part files with name
"part-<subtask_number>-<count_of_the_bucket_created_by_that_subtask>". The
count is an integer and is incrementing on each rollover. Now my doubts
are:

1. When does this count reset to 0 ?
2. Is there a way i can reset this count programmatically ? Since we are
using day bucket we would like the count to reset every day.

We are using Flink 1.8

Thanks
Sidhartha

Re: StreamingFileSink part file count reset

Posted by Biao Liu <mm...@gmail.com>.
Hi Sidhartha,

I don't think you should worry about this.

Currently the `StreamingFileSink` uses a long to keep this counter. The
maximum of long is 9,223,372,036,854,775,807. The counter would be reset if
count of files reaches that value. I don't think it should happen. WRT the
max filename length, for example, Linux allows 255 characters for most file
systems [1]. It's far more larger than the length of maximum length of long.

1.
https://unix.stackexchange.com/questions/32795/what-is-the-maximum-allowed-filename-and-folder-size-with-ecryptfs

Thanks,
Biao /'bɪ.aʊ/



On Fri, Aug 2, 2019 at 12:24 AM sidhartha saurav <si...@gmail.com>
wrote:

> Thank you for the clarification Habibo and Andrey.
>
> Is there any limitation after which the global counter will reset ? I mean
> do we have to worry the counter may get too long and part file crosses the
> max filename length limit set by OS or is it handled by flink.
>
> Thanks
> Sidhartha
>
> On Tue, Jul 30, 2019, 10:10 AM Andrey Zagrebin <an...@ververica.com>
> wrote:
>
>> Hi Sidhartha,
>>
>> This is a general limitation now because Flink does not keep counters for
>> all buckets but only a global one.
>> Flink assumes that the sink can write to any bucket any time and the
>> counter is not reset to not rewrite the previously written file number 0.
>>
>> Best,
>> Andrey
>>
>> On Tue, Jul 30, 2019 at 7:01 AM Haibo Sun <su...@163.com> wrote:
>>
>>> Hi Sidhartha,
>>>
>>> Currently, the part counter is never reset to 0, nor is it allowed to
>>> customize the part filename. So I don't think there's any way to reset it
>>> right now.  I guess the reason why it can't be reset to 0 is that it is
>>> concerned that the previous parts will be overwritten. Although the bucket
>>> id is part of the part file path, StreamingFileSink does not know when the
>>> bucket id will change in the case of custom BucketAssginer.
>>>
>>> Best,
>>> Haibo
>>>
>>> At 2019-07-30 06:13:54, "sidhartha saurav" <si...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We are using StreamingFileSink with a custom BucketAssigner and
>>> DefaultRollingPolicy. The custom BucketAssigner is simply a date bucket
>>> assigner. The StreamingFileSink creates part files with name
>>> "part-<subtask_number>-<count_of_the_bucket_created_by_that_subtask>". The
>>> count is an integer and is incrementing on each rollover. Now my doubts
>>> are:
>>>
>>> 1. When does this count reset to 0 ?
>>> 2. Is there a way i can reset this count programmatically ? Since we are
>>> using day bucket we would like the count to reset every day.
>>>
>>> We are using Flink 1.8
>>>
>>> Thanks
>>> Sidhartha
>>>
>>>

Re: StreamingFileSink part file count reset

Posted by sidhartha saurav <si...@gmail.com>.
Thank you for the clarification Habibo and Andrey.

Is there any limitation after which the global counter will reset ? I mean
do we have to worry the counter may get too long and part file crosses the
max filename length limit set by OS or is it handled by flink.

Thanks
Sidhartha

On Tue, Jul 30, 2019, 10:10 AM Andrey Zagrebin <an...@ververica.com> wrote:

> Hi Sidhartha,
>
> This is a general limitation now because Flink does not keep counters for
> all buckets but only a global one.
> Flink assumes that the sink can write to any bucket any time and the
> counter is not reset to not rewrite the previously written file number 0.
>
> Best,
> Andrey
>
> On Tue, Jul 30, 2019 at 7:01 AM Haibo Sun <su...@163.com> wrote:
>
>> Hi Sidhartha,
>>
>> Currently, the part counter is never reset to 0, nor is it allowed to
>> customize the part filename. So I don't think there's any way to reset it
>> right now.  I guess the reason why it can't be reset to 0 is that it is
>> concerned that the previous parts will be overwritten. Although the bucket
>> id is part of the part file path, StreamingFileSink does not know when the
>> bucket id will change in the case of custom BucketAssginer.
>>
>> Best,
>> Haibo
>>
>> At 2019-07-30 06:13:54, "sidhartha saurav" <si...@gmail.com> wrote:
>>
>> Hi,
>>
>> We are using StreamingFileSink with a custom BucketAssigner and
>> DefaultRollingPolicy. The custom BucketAssigner is simply a date bucket
>> assigner. The StreamingFileSink creates part files with name
>> "part-<subtask_number>-<count_of_the_bucket_created_by_that_subtask>". The
>> count is an integer and is incrementing on each rollover. Now my doubts
>> are:
>>
>> 1. When does this count reset to 0 ?
>> 2. Is there a way i can reset this count programmatically ? Since we are
>> using day bucket we would like the count to reset every day.
>>
>> We are using Flink 1.8
>>
>> Thanks
>> Sidhartha
>>
>>

Re: StreamingFileSink part file count reset

Posted by Andrey Zagrebin <an...@ververica.com>.
Hi Sidhartha,

This is a general limitation now because Flink does not keep counters for
all buckets but only a global one.
Flink assumes that the sink can write to any bucket any time and the
counter is not reset to not rewrite the previously written file number 0.

Best,
Andrey

On Tue, Jul 30, 2019 at 7:01 AM Haibo Sun <su...@163.com> wrote:

> Hi Sidhartha,
>
> Currently, the part counter is never reset to 0, nor is it allowed to
> customize the part filename. So I don't think there's any way to reset it
> right now.  I guess the reason why it can't be reset to 0 is that it is
> concerned that the previous parts will be overwritten. Although the bucket
> id is part of the part file path, StreamingFileSink does not know when the
> bucket id will change in the case of custom BucketAssginer.
>
> Best,
> Haibo
>
> At 2019-07-30 06:13:54, "sidhartha saurav" <si...@gmail.com> wrote:
>
> Hi,
>
> We are using StreamingFileSink with a custom BucketAssigner and
> DefaultRollingPolicy. The custom BucketAssigner is simply a date bucket
> assigner. The StreamingFileSink creates part files with name
> "part-<subtask_number>-<count_of_the_bucket_created_by_that_subtask>". The
> count is an integer and is incrementing on each rollover. Now my doubts
> are:
>
> 1. When does this count reset to 0 ?
> 2. Is there a way i can reset this count programmatically ? Since we are
> using day bucket we would like the count to reset every day.
>
> We are using Flink 1.8
>
> Thanks
> Sidhartha
>
>

Re:StreamingFileSink part file count reset

Posted by Haibo Sun <su...@163.com>.
Hi Sidhartha,


Currently, the part counter is never reset to 0, nor is it allowed to customize the part filename. So I don't think there's any way to reset it right now.  I guess the reason why it can't be reset to 0 is that it is concerned that the previous parts will be overwritten. Although the bucket id is part of the part file path, StreamingFileSink does not know when the bucket id will change in the case of custom BucketAssginer.


Best,
Haibo

At 2019-07-30 06:13:54, "sidhartha saurav" <si...@gmail.com> wrote:

Hi,

We are using StreamingFileSink with a custom BucketAssigner and DefaultRollingPolicy. The custom BucketAssigner is simply a date bucket assigner. The StreamingFileSink creates part files with name "part-<subtask_number>-<count_of_the_bucket_created_by_that_subtask>". The count is an integer and is incrementing on each rollover. Now my doubts are:

1. When does this count reset to 0 ?
2. Is there a way i can reset this count programmatically ? Since we are using day bucket we would like the count to reset every day.

We are using Flink 1.8

Thanks
Sidhartha