You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Marco Costantini <ma...@rocketbnk.com> on 2021/09/24 01:00:48 UTC

FileIO bug?

Hi team,
I'm experiencing, what I think is, a bug with FileIO. I've got a very
minimal pipeline which reproduces it:

```
    pipeline
 .apply(FileIO.`match`.filepattern(String.format("%s/**/*",location)).continuously(Duration.standardSeconds(100),
Watch.Growth.never() ))
      .apply(FileIO.readMatches)
      .getPipeline
      .run
      .waitUntilFinish()
```

What happens is that, in my logging, it polls far more often than
requested. The logs look like this:

```
2021-09-23 23:08:09,570 INFO  org.apache.beam.sdk.transforms.Watch
                [] - s3://bucket/raw-events//**/* - will resume polling in
100000 ms.
                                                 2021-09-23 23:08:09,633
INFO  org.apache.beam.sdk.transforms.Watch                         [] -
s3://bucket/raw-events//**/* -
 current round of polling took 62 ms and returned 0 results, of which 0
were new.
2021-09-23 23:08:09,634 INFO  org.apache.beam.sdk.transforms.Watch
                [] - s3://bucket/raw-events//**/* - will resume polling in
100000 ms.
                                                 2021-09-23 23:08:09,701
INFO  org.apache.beam.sdk.transforms.Watch                         [] -
s3://bucket/raw-events//**/* -
 current round of polling took 66 ms and returned 0 results, of which 0
were new.
```

Rather than once every 100 seconds, it polls many times PER second.
Regardless of what I change the duration too, it seems to poll at this same
frequency.

Some technical aspects of my project:
- Beam 2.32
- Flink 1.12.2
- FlinkRunner 1-12
- building on java 8

Any ideas or help is greatly appreciated. Please and thank you,
Marco.

Re: FileIO bug?

Posted by Marco Costantini <ma...@rocketbnk.com>.
I have made some recent findings.

To answer your question, it does not happen when using a DirectRunner. So I
focused more on our FlinkRunner's config.

I found that adding this config alleviates the problem:
optionsWithFlinkOptions.setCheckpointingInterval(100)

This configuration was not set at all. If someone understands why this
configuration affects the `FileIO.continuous` duration, I'd love to know.

Please and thank you,
Marco.

On Mon, Sep 27, 2021 at 1:52 PM Chamikara Jayalath <ch...@google.com>
wrote:

> This does sound like a bug, more specifically regarding timer invocations
> of the Flink runner.
> Do you run into the same issue with other runners ?
>
> On Thu, Sep 23, 2021 at 6:01 PM Marco Costantini <
> marco.costantini@rocketbnk.com> wrote:
>
>> Hi team,
>> I'm experiencing, what I think is, a bug with FileIO. I've got a very
>> minimal pipeline which reproduces it:
>>
>> ```
>>     pipeline
>>  .apply(FileIO.`match`.filepattern(String.format("%s/**/*",location)).continuously(Duration.standardSeconds(100),
>> Watch.Growth.never() ))
>>       .apply(FileIO.readMatches)
>>       .getPipeline
>>       .run
>>       .waitUntilFinish()
>> ```
>>
>> What happens is that, in my logging, it polls far more often than
>> requested. The logs look like this:
>>
>> ```
>> 2021-09-23 23:08:09,570 INFO  org.apache.beam.sdk.transforms.Watch
>>                   [] - s3://bucket/raw-events//**/* - will resume polling
>> in 100000 ms.
>>                                                    2021-09-23 23:08:09,633
>> INFO  org.apache.beam.sdk.transforms.Watch                         [] -
>> s3://bucket/raw-events//**/* -
>>  current round of polling took 62 ms and returned 0 results, of which 0
>> were new.
>> 2021-09-23 23:08:09,634 INFO  org.apache.beam.sdk.transforms.Watch
>>                 [] - s3://bucket/raw-events//**/* - will resume polling in
>> 100000 ms.
>>                                                  2021-09-23 23:08:09,701
>> INFO  org.apache.beam.sdk.transforms.Watch                         [] -
>> s3://bucket/raw-events//**/* -
>>  current round of polling took 66 ms and returned 0 results, of which 0
>> were new.
>> ```
>>
>> Rather than once every 100 seconds, it polls many times PER second.
>> Regardless of what I change the duration too, it seems to poll at this same
>> frequency.
>>
>> Some technical aspects of my project:
>> - Beam 2.32
>> - Flink 1.12.2
>> - FlinkRunner 1-12
>> - building on java 8
>>
>> Any ideas or help is greatly appreciated. Please and thank you,
>> Marco.
>>
>

Re: FileIO bug?

Posted by Chamikara Jayalath <ch...@google.com>.
This does sound like a bug, more specifically regarding timer invocations
of the Flink runner.
Do you run into the same issue with other runners ?

On Thu, Sep 23, 2021 at 6:01 PM Marco Costantini <
marco.costantini@rocketbnk.com> wrote:

> Hi team,
> I'm experiencing, what I think is, a bug with FileIO. I've got a very
> minimal pipeline which reproduces it:
>
> ```
>     pipeline
>  .apply(FileIO.`match`.filepattern(String.format("%s/**/*",location)).continuously(Duration.standardSeconds(100),
> Watch.Growth.never() ))
>       .apply(FileIO.readMatches)
>       .getPipeline
>       .run
>       .waitUntilFinish()
> ```
>
> What happens is that, in my logging, it polls far more often than
> requested. The logs look like this:
>
> ```
> 2021-09-23 23:08:09,570 INFO  org.apache.beam.sdk.transforms.Watch
>                 [] - s3://bucket/raw-events//**/* - will resume polling in
> 100000 ms.
>                                                  2021-09-23 23:08:09,633
> INFO  org.apache.beam.sdk.transforms.Watch                         [] -
> s3://bucket/raw-events//**/* -
>  current round of polling took 62 ms and returned 0 results, of which 0
> were new.
> 2021-09-23 23:08:09,634 INFO  org.apache.beam.sdk.transforms.Watch
>                 [] - s3://bucket/raw-events//**/* - will resume polling in
> 100000 ms.
>                                                  2021-09-23 23:08:09,701
> INFO  org.apache.beam.sdk.transforms.Watch                         [] -
> s3://bucket/raw-events//**/* -
>  current round of polling took 66 ms and returned 0 results, of which 0
> were new.
> ```
>
> Rather than once every 100 seconds, it polls many times PER second.
> Regardless of what I change the duration too, it seems to poll at this same
> frequency.
>
> Some technical aspects of my project:
> - Beam 2.32
> - Flink 1.12.2
> - FlinkRunner 1-12
> - building on java 8
>
> Any ideas or help is greatly appreciated. Please and thank you,
> Marco.
>