You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Paul Lam <pa...@gmail.com> on 2019/03/27 11:06:11 UTC

RocksDB local snapshot sliently disappears and cause checkpoint to fail

Hi,

I��m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb statebackend. A job that runs fine on a YARN cluster keeps failing on checkpoint after migrated to a new one 
(with almost everything the same but better machines), and even a clean restart doesn��t help. 

The root cause is IllegalStateException but with no error message. The stack trace shows that when the rocksdb statebackend is doing the async part of snapshots (runSnapshot), 
it finds that the local snapshot directory that is created by rocksdb earlier (takeSnapshot) does not exist. 

I tried to log more informations in RocksDBKeyedStateBackend (see attachment), and found that the local snapshot performed as expected and the .sst files were written, 
but when the async task accessed the directory, the whole snapshot directory was gone. 

What could possibly be the cause? Thanks a lot.

Best,
Paul Lam

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Posted by Yu Li <li...@apache.org>.

Ok, much clearer now. Thanks.

Best Regards,
Yu


On Thu, 28 Mar 2019 at 15:59, Paul Lam <pa...@gmail.com> wrote:

> Hi Yu,
>
> I’ve set `fs.default-scheme` to hdfs, and it's mainly used for simplifying
> checkpoint / savepoint / HA paths.
>
> And I leave the rocksdb local dir empty, so the local snapshot still goes
> to YARN local cache dirs.
>
> Hope that answers your question.
>
> Best,
> Paul Lam
>
> 在 2019年3月28日，15:34，Yu Li <li...@apache.org> 写道：
>
> Hi Paul,
>
> Regarding "mistakenly uses the default filesystem scheme, which is
> specified to hdfs in the new cluster in my case", could you further clarify
> the configuration property and value you're using? Do you mean you're using
> an HDFS directory to store the local snapshot data? Thanks.
> Best Regards,
> Yu
>
>
>
> On Thu, 28 Mar 2019 at 14:34, Paul Lam <pa...@gmail.com> wrote:
>
>> Hi,
>>
>> It turns out that under certain circumstances rocksdb statebackend
>> mistakenly uses the default filesystem scheme, which is specified to hdfs
>> in the new cluster in my case.
>>
>> I’ve filed a Jira to track this[1].
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-12042
>>
>> Best,
>> Paul Lam
>>
>> 在 2019年3月27日，19:06，Paul Lam <pa...@gmail.com> 写道：
>>
>> Hi,
>>
>> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb
>> statebackend. A job that runs fine on a YARN cluster keeps failing on
>> checkpoint after migrated to a new one
>> (with almost everything the same but better machines), and even a clean
>> restart doesn’t help.
>>
>> The root cause is IllegalStateException but with no error message. The
>> stack trace shows that when the rocksdb statebackend is doing the async
>> part of snapshots (runSnapshot),
>> it finds that the local snapshot directory that is created by rocksdb
>> earlier (takeSnapshot) does not exist.
>>
>> I tried to log more informations in RocksDBKeyedStateBackend (see
>> attachment), and found that the local snapshot performed as expected and
>> the .sst files were written,
>> but when the async task accessed the directory, the whole snapshot
>> directory was gone.
>>
>> What could possibly be the cause? Thanks a lot.
>>
>> Best,
>> Paul Lam
>>
>> <rocksdb_illegal_state.log.md>
>>
>>
>>

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Posted by Paul Lam <pa...@gmail.com>.

Hi Yu,

I’ve set `fs.default-scheme` to hdfs, and it's mainly used for simplifying checkpoint / savepoint / HA paths.

And I leave the rocksdb local dir empty, so the local snapshot still goes to YARN local cache dirs.

Hope that answers your question.

Best,
Paul Lam

> 在 2019年3月28日，15:34，Yu Li <li...@apache.org> 写道：
> 
> Hi Paul,
> 
> Regarding "mistakenly uses the default filesystem scheme, which is specified to hdfs in the new cluster in my case", could you further clarify the configuration property and value you're using? Do you mean you're using an HDFS directory to store the local snapshot data? Thanks.
> 
> Best Regards,
> Yu
> 
> 
> On Thu, 28 Mar 2019 at 14:34, Paul Lam <paullin3280@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> It turns out that under certain circumstances rocksdb statebackend mistakenly uses the default filesystem scheme, which is specified to hdfs in the new cluster in my case.
> 
> I’ve filed a Jira to track this[1]. 
> 
> [1] https://issues.apache.org/jira/browse/FLINK-12042 <https://issues.apache.org/jira/browse/FLINK-12042>
> 
> Best,
> Paul Lam
> 
>> 在 2019年3月27日，19:06，Paul Lam <paullin3280@gmail.com <ma...@gmail.com>> 写道：
>> 
>> Hi,
>> 
>> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb statebackend. A job that runs fine on a YARN cluster keeps failing on checkpoint after migrated to a new one 
>> (with almost everything the same but better machines), and even a clean restart doesn’t help. 
>> 
>> The root cause is IllegalStateException but with no error message. The stack trace shows that when the rocksdb statebackend is doing the async part of snapshots (runSnapshot), 
>> it finds that the local snapshot directory that is created by rocksdb earlier (takeSnapshot) does not exist. 
>> 
>> I tried to log more informations in RocksDBKeyedStateBackend (see attachment), and found that the local snapshot performed as expected and the .sst files were written, 
>> but when the async task accessed the directory, the whole snapshot directory was gone. 
>> 
>> What could possibly be the cause? Thanks a lot.
>> 
>> Best,
>> Paul Lam
>> 
>> <rocksdb_illegal_state.log.md <http://rocksdb_illegal_state.log.md/>>
>> 
>

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Posted by Yu Li <li...@apache.org>.

Hi Paul,

Regarding "mistakenly uses the default filesystem scheme, which is
specified to hdfs in the new cluster in my case", could you further clarify
the configuration property and value you're using? Do you mean you're using
an HDFS directory to store the local snapshot data? Thanks.

Best Regards,
Yu


On Thu, 28 Mar 2019 at 14:34, Paul Lam <pa...@gmail.com> wrote:

> Hi,
>
> It turns out that under certain circumstances rocksdb statebackend
> mistakenly uses the default filesystem scheme, which is specified to hdfs
> in the new cluster in my case.
>
> I’ve filed a Jira to track this[1].
>
> [1] https://issues.apache.org/jira/browse/FLINK-12042
>
> Best,
> Paul Lam
>
> 在 2019年3月27日，19:06，Paul Lam <pa...@gmail.com> 写道：
>
> Hi,
>
> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb
> statebackend. A job that runs fine on a YARN cluster keeps failing on
> checkpoint after migrated to a new one
> (with almost everything the same but better machines), and even a clean
> restart doesn’t help.
>
> The root cause is IllegalStateException but with no error message. The
> stack trace shows that when the rocksdb statebackend is doing the async
> part of snapshots (runSnapshot),
> it finds that the local snapshot directory that is created by rocksdb
> earlier (takeSnapshot) does not exist.
>
> I tried to log more informations in RocksDBKeyedStateBackend (see
> attachment), and found that the local snapshot performed as expected and
> the .sst files were written,
> but when the async task accessed the directory, the whole snapshot
> directory was gone.
>
> What could possibly be the cause? Thanks a lot.
>
> Best,
> Paul Lam
>
> <rocksdb_illegal_state.log.md>
>
>
>

Re: RocksDB local snapshot sliently disappears and cause checkpoint to fail

Posted by Paul Lam <pa...@gmail.com>.

Hi,

It turns out that under certain circumstances rocksdb statebackend mistakenly uses the default filesystem scheme, which is specified to hdfs in the new cluster in my case.

I’ve filed a Jira to track this[1]. 

[1] https://issues.apache.org/jira/browse/FLINK-12042 <https://issues.apache.org/jira/browse/FLINK-12042>

Best,
Paul Lam

> 在 2019年3月27日，19:06，Paul Lam <pa...@gmail.com> 写道：
> 
> Hi,
> 
> I’m using Flink 1.6.4 and recently I ran into a weird issue of rocksdb statebackend. A job that runs fine on a YARN cluster keeps failing on checkpoint after migrated to a new one 
> (with almost everything the same but better machines), and even a clean restart doesn’t help. 
> 
> The root cause is IllegalStateException but with no error message. The stack trace shows that when the rocksdb statebackend is doing the async part of snapshots (runSnapshot), 
> it finds that the local snapshot directory that is created by rocksdb earlier (takeSnapshot) does not exist. 
> 
> I tried to log more informations in RocksDBKeyedStateBackend (see attachment), and found that the local snapshot performed as expected and the .sst files were written, 
> but when the async task accessed the directory, the whole snapshot directory was gone. 
> 
> What could possibly be the cause? Thanks a lot.
> 
> Best,
> Paul Lam
> 
> <rocksdb_illegal_state.log.md>
>