You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Piyush Narang <p....@criteo.com> on 2020/03/06 22:15:18 UTC

Understanding n LIST calls as part of checkpointing

Hi folks,

I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better.
Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234

What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure:
my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint
my_user/featureflow/foo-datacenter/cluster_name/my_flink_job
my_user/featureflow/foo-datacenter/cluster_name
my_user/featureflow/foo-datacenter
my_user/featureflow
my_user

Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List calls for each parent ‘directory’ / blob all the way to the top was normal / expected?

We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with this I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.

Thanks,

-- Piyush


Re: Understanding n LIST calls as part of checkpointing

Posted by Piyush Narang <p....@criteo.com>.
Hi Yun,

Thanks for getting back. We’re on a fork of Flink 1.9 (basically 1.9 with some backported fixes from 1.10 and a couple of minor patches) - https://github.com/criteo-forks/flink/tree/criteo-1.9
I’ll check the jira + fix and see if there’s something that was potentially missed.

-- Piyush


From: Yun Tang <my...@live.com>
Date: Sunday, March 8, 2020 at 11:05 PM
To: Piyush Narang <p....@criteo.com>, user <us...@flink.apache.org>
Subject: Re: Understanding n LIST calls as part of checkpointing

Hi Piyush

Which version of Flink do you use? After Flink-1.5, Flink would not call any "List" operation on checkpoint side with FLINK-8540 [1]. The only left "List" operation would be used when reading files in file input format. In a nut shell, these "List" calls should not come from Flink if you're using Flink-1.5+


[1] https://issues.apache.org/jira/browse/FLINK-8540<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-8540&data=02%7C01%7Cp.narang%40criteo.com%7C9e22e8abdb364375bbb608d7c3d6b184%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637193199179063184&sdata=2u90FMzKtc3AXqL4sjndkyjFhdAAswbhZyde0YOdwhQ%3D&reserved=0>

Best
Yun Tang

________________________________
From: Piyush Narang <p....@criteo.com>
Sent: Saturday, March 7, 2020 6:15
To: user <us...@flink.apache.org>
Subject: Understanding n LIST calls as part of checkpointing


Hi folks,



I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better.

Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234



What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure:

my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint

my_user/featureflow/foo-datacenter/cluster_name/my_flink_job

my_user/featureflow/foo-datacenter/cluster_name

my_user/featureflow/foo-datacenter

my_user/featureflow

my_user



Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List calls for each parent ‘directory’ / blob all the way to the top was normal / expected?



We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with this I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.



Thanks,



-- Piyush



Re: Understanding n LIST calls as part of checkpointing

Posted by Yun Tang <my...@live.com>.
Hi Piyush

Which version of Flink do you use? After Flink-1.5, Flink would not call any "List" operation on checkpoint side with FLINK-8540 [1]. The only left "List" operation would be used when reading files in file input format. In a nut shell, these "List" calls should not come from Flink if you're using Flink-1.5+


[1] https://issues.apache.org/jira/browse/FLINK-8540

Best
Yun Tang

________________________________
From: Piyush Narang <p....@criteo.com>
Sent: Saturday, March 7, 2020 6:15
To: user <us...@flink.apache.org>
Subject: Understanding n LIST calls as part of checkpointing


Hi folks,



I was trying to debug a job which was taking 20-30s to checkpoint data to Azure FS (compared to typically < 5s) and as part of doing so, I noticed something that I was trying to figure out a bit better.

Our checkpoint path is as follows: my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint/chk-1234



What I noticed was that while trying to take checkpoints (incremental using rocksDB) we make a number of List calls to Azure:

my_user/featureflow/foo-datacenter/cluster_name/my_flink_job/checkpoint

my_user/featureflow/foo-datacenter/cluster_name/my_flink_job

my_user/featureflow/foo-datacenter/cluster_name

my_user/featureflow/foo-datacenter

my_user/featureflow

my_user



Each of these calls takes a few seconds and all of them seem to add up to make our checkpoint take time. The part I was hoping to understand on the Flink side was whether the behavior of making these List calls for each parent ‘directory’ / blob all the way to the top was normal / expected?



We are exploring a couple of other angles on our end (potentially flattening the directory / blob structure to reduce the number of these calls, is the latency on the Azure side expected), but along with this I was hoping to understand if this behavior on the Flink side is expected / if there’s something which we could optimize as well.



Thanks,



-- Piyush