You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@asterixdb.apache.org by Chen Luo <cl...@uci.edu> on 2017/11/29 21:54:34 UTC

About the system behavior when the checkpoint is corrupted

Hi devs,

Recently I was experiencing a very annoying issue about recovery. The
checkpoint file of my dataset was somehow corrupted (and I didn't know
why). However, when I was restarting AsterixDB, it fails to read the
checkpoint file, and starts recovering as a clean state. This is highly
undesirable in the sense that it clean up all of my experiment datasets
saliently, roughly 100GB. And it'll take me days to re-ingest these data to
resume my experiments.

I think the behavior of cleaning up all data when some small thing goes
wrong is undesirable and dangerous. When AsterixDB fails to restart, and
finds the data directory non-empty, I think it should notify the user and
let the user to make the decision. For example, it could fail to restart at
this time, and user could clean up the directory manually, or try to use a
backup checkpoint file, or add some flag to force restart. Anyway, blindly
cleaning up all files seem to be a dangerous solution.

Any thoughts on this?

Best regards,
Chen Luo

Re: About the system behavior when the checkpoint is corrupted

Posted by Mike Carey <dt...@gmail.com>.
IMO missing checkpoints should not be taken as the indicator of 
first-time bootstrap - the invocation path for a node should be explicit 
about that.  I.e., a node should be started explicitly either in 
first-time mode or not, and should thus only delete data if it was told 
explicitly that it's in first-time mode.


On 11/29/17 2:52 PM, Murtadha Hubail wrote:
> Just to clarify a couple of things.
>
> If all existing checkpoints are corrupted, nothing is deleted and recovery is performed from the beginning of the transactions log. In this case, most
> likely your checkpoint file was completely missing. You can check the logs to confirm if you still have them.
>
> Regarding deleting storage on missing checkpoints, at one point many developers were encountering issues during development or testing because
> some test case is leaving some storage data and on the next run a failure is encountered due to existing storage artifacts left behind. Hence, the
> decision was made to delete existing storage on missing checkpoints since they are our current indicator of a node’s first-time bootstrap or recovery
> after restart.
>
> Having said that, I agree that deleting all existing data isn’t the right thing to do.
>
> Cheers,
> Murtadha
>
> On 11/30/2017, 1:26 AM, "Chen Luo" <cl...@uci.edu> wrote:
>
>      I'm not sure how the checkpoint file was corrupted. For my experiments, I
>      have some versions of AsterixDB sharing the same storage dir (so that I can
>      evaluate the performance after making some changes). Recently I synced my
>      branch with master, and maybe this causes some problem with the checkpoint
>      file (e.g., different versions of codebase?).
>      
>      However, I think cleaning up the entire data directory is dangerous. The
>      user (such as me) can backup the checkpoint file because it's small, but it
>      would be cumbersome to backup the entire data directory. When there indeed
>      is something wrong with the checkpoint file, it's better that the user can
>      be aware of this, and make decisions by himself.
>      
>      Best regards,
>      Chen Luo
>      
>      On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <ba...@gmail.com>
>      wrote:
>      
>      > I wonder how it got to that state.
>      >
>      > The first thing an instance does after initialization is create the
>      > snapshot file.
>      > This will only be deleted after a new (uncorrupted) snapshot file is
>      > created.
>      >
>      > I understand your point, but I wonder how it got to this state. Bug!?
>      >
>      > Cheers,
>      > Abdullah.
>      >
>      > > On Nov 29, 2017, at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
>      > >
>      > > Hi devs,
>      > >
>      > > Recently I was experiencing a very annoying issue about recovery. The
>      > > checkpoint file of my dataset was somehow corrupted (and I didn't know
>      > > why). However, when I was restarting AsterixDB, it fails to read the
>      > > checkpoint file, and starts recovering as a clean state. This is highly
>      > > undesirable in the sense that it clean up all of my experiment datasets
>      > > saliently, roughly 100GB. And it'll take me days to re-ingest these data
>      > to
>      > > resume my experiments.
>      > >
>      > > I think the behavior of cleaning up all data when some small thing goes
>      > > wrong is undesirable and dangerous. When AsterixDB fails to restart, and
>      > > finds the data directory non-empty, I think it should notify the user and
>      > > let the user to make the decision. For example, it could fail to restart
>      > at
>      > > this time, and user could clean up the directory manually, or try to use
>      > a
>      > > backup checkpoint file, or add some flag to force restart. Anyway,
>      > blindly
>      > > cleaning up all files seem to be a dangerous solution.
>      > >
>      > > Any thoughts on this?
>      > >
>      > > Best regards,
>      > > Chen Luo
>      >
>      >
>      
>
>


Re: About the system behavior when the checkpoint is corrupted

Posted by Murtadha Hubail <hu...@gmail.com>.
Just to clarify a couple of things.

If all existing checkpoints are corrupted, nothing is deleted and recovery is performed from the beginning of the transactions log. In this case, most
likely your checkpoint file was completely missing. You can check the logs to confirm if you still have them.

Regarding deleting storage on missing checkpoints, at one point many developers were encountering issues during development or testing because
some test case is leaving some storage data and on the next run a failure is encountered due to existing storage artifacts left behind. Hence, the
decision was made to delete existing storage on missing checkpoints since they are our current indicator of a node’s first-time bootstrap or recovery
after restart.

Having said that, I agree that deleting all existing data isn’t the right thing to do.

Cheers,
Murtadha

On 11/30/2017, 1:26 AM, "Chen Luo" <cl...@uci.edu> wrote:

    I'm not sure how the checkpoint file was corrupted. For my experiments, I
    have some versions of AsterixDB sharing the same storage dir (so that I can
    evaluate the performance after making some changes). Recently I synced my
    branch with master, and maybe this causes some problem with the checkpoint
    file (e.g., different versions of codebase?).
    
    However, I think cleaning up the entire data directory is dangerous. The
    user (such as me) can backup the checkpoint file because it's small, but it
    would be cumbersome to backup the entire data directory. When there indeed
    is something wrong with the checkpoint file, it's better that the user can
    be aware of this, and make decisions by himself.
    
    Best regards,
    Chen Luo
    
    On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <ba...@gmail.com>
    wrote:
    
    > I wonder how it got to that state.
    >
    > The first thing an instance does after initialization is create the
    > snapshot file.
    > This will only be deleted after a new (uncorrupted) snapshot file is
    > created.
    >
    > I understand your point, but I wonder how it got to this state. Bug!?
    >
    > Cheers,
    > Abdullah.
    >
    > > On Nov 29, 2017, at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
    > >
    > > Hi devs,
    > >
    > > Recently I was experiencing a very annoying issue about recovery. The
    > > checkpoint file of my dataset was somehow corrupted (and I didn't know
    > > why). However, when I was restarting AsterixDB, it fails to read the
    > > checkpoint file, and starts recovering as a clean state. This is highly
    > > undesirable in the sense that it clean up all of my experiment datasets
    > > saliently, roughly 100GB. And it'll take me days to re-ingest these data
    > to
    > > resume my experiments.
    > >
    > > I think the behavior of cleaning up all data when some small thing goes
    > > wrong is undesirable and dangerous. When AsterixDB fails to restart, and
    > > finds the data directory non-empty, I think it should notify the user and
    > > let the user to make the decision. For example, it could fail to restart
    > at
    > > this time, and user could clean up the directory manually, or try to use
    > a
    > > backup checkpoint file, or add some flag to force restart. Anyway,
    > blindly
    > > cleaning up all files seem to be a dangerous solution.
    > >
    > > Any thoughts on this?
    > >
    > > Best regards,
    > > Chen Luo
    >
    >
    



Re: About the system behavior when the checkpoint is corrupted

Posted by Chen Luo <cl...@uci.edu>.
I'm not sure how the checkpoint file was corrupted. For my experiments, I
have some versions of AsterixDB sharing the same storage dir (so that I can
evaluate the performance after making some changes). Recently I synced my
branch with master, and maybe this causes some problem with the checkpoint
file (e.g., different versions of codebase?).

However, I think cleaning up the entire data directory is dangerous. The
user (such as me) can backup the checkpoint file because it's small, but it
would be cumbersome to backup the entire data directory. When there indeed
is something wrong with the checkpoint file, it's better that the user can
be aware of this, and make decisions by himself.

Best regards,
Chen Luo

On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <ba...@gmail.com>
wrote:

> I wonder how it got to that state.
>
> The first thing an instance does after initialization is create the
> snapshot file.
> This will only be deleted after a new (uncorrupted) snapshot file is
> created.
>
> I understand your point, but I wonder how it got to this state. Bug!?
>
> Cheers,
> Abdullah.
>
> > On Nov 29, 2017, at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
> >
> > Hi devs,
> >
> > Recently I was experiencing a very annoying issue about recovery. The
> > checkpoint file of my dataset was somehow corrupted (and I didn't know
> > why). However, when I was restarting AsterixDB, it fails to read the
> > checkpoint file, and starts recovering as a clean state. This is highly
> > undesirable in the sense that it clean up all of my experiment datasets
> > saliently, roughly 100GB. And it'll take me days to re-ingest these data
> to
> > resume my experiments.
> >
> > I think the behavior of cleaning up all data when some small thing goes
> > wrong is undesirable and dangerous. When AsterixDB fails to restart, and
> > finds the data directory non-empty, I think it should notify the user and
> > let the user to make the decision. For example, it could fail to restart
> at
> > this time, and user could clean up the directory manually, or try to use
> a
> > backup checkpoint file, or add some flag to force restart. Anyway,
> blindly
> > cleaning up all files seem to be a dangerous solution.
> >
> > Any thoughts on this?
> >
> > Best regards,
> > Chen Luo
>
>

Re: About the system behavior when the checkpoint is corrupted

Posted by Ian Maxon <im...@uci.edu>.
To be more precise what I saw, was that the checkpoint file was
actually there but 0 length, if memory serves (and hence corrupt).

On Wed, Nov 29, 2017 at 2:11 PM, abdullah alamoudi <ba...@gmail.com> wrote:
> I wonder how it got to that state.
>
> The first thing an instance does after initialization is create the snapshot file.
> This will only be deleted after a new (uncorrupted) snapshot file is created.
>
> I understand your point, but I wonder how it got to this state. Bug!?
>
> Cheers,
> Abdullah.
>
>> On Nov 29, 2017, at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
>>
>> Hi devs,
>>
>> Recently I was experiencing a very annoying issue about recovery. The
>> checkpoint file of my dataset was somehow corrupted (and I didn't know
>> why). However, when I was restarting AsterixDB, it fails to read the
>> checkpoint file, and starts recovering as a clean state. This is highly
>> undesirable in the sense that it clean up all of my experiment datasets
>> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
>> resume my experiments.
>>
>> I think the behavior of cleaning up all data when some small thing goes
>> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
>> finds the data directory non-empty, I think it should notify the user and
>> let the user to make the decision. For example, it could fail to restart at
>> this time, and user could clean up the directory manually, or try to use a
>> backup checkpoint file, or add some flag to force restart. Anyway, blindly
>> cleaning up all files seem to be a dangerous solution.
>>
>> Any thoughts on this?
>>
>> Best regards,
>> Chen Luo
>

Re: About the system behavior when the checkpoint is corrupted

Posted by abdullah alamoudi <ba...@gmail.com>.
I wonder how it got to that state.

The first thing an instance does after initialization is create the snapshot file.
This will only be deleted after a new (uncorrupted) snapshot file is created.

I understand your point, but I wonder how it got to this state. Bug!?

Cheers,
Abdullah.

> On Nov 29, 2017, at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
> 
> Hi devs,
> 
> Recently I was experiencing a very annoying issue about recovery. The
> checkpoint file of my dataset was somehow corrupted (and I didn't know
> why). However, when I was restarting AsterixDB, it fails to read the
> checkpoint file, and starts recovering as a clean state. This is highly
> undesirable in the sense that it clean up all of my experiment datasets
> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
> resume my experiments.
> 
> I think the behavior of cleaning up all data when some small thing goes
> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
> finds the data directory non-empty, I think it should notify the user and
> let the user to make the decision. For example, it could fail to restart at
> this time, and user could clean up the directory manually, or try to use a
> backup checkpoint file, or add some flag to force restart. Anyway, blindly
> cleaning up all files seem to be a dangerous solution.
> 
> Any thoughts on this?
> 
> Best regards,
> Chen Luo


Re: About the system behavior when the checkpoint is corrupted

Posted by Mike Carey <dt...@gmail.com>.
+1 for not proceeding and simply removing the data in this (ideally 
unreachable) state....


On 11/29/17 2:15 PM, Ian Maxon wrote:
> I too have seen this issue, but I couldn't reproduce or surmise how it
> might happen from just inspecting the code. How'd it appear for you?
> I would disagree that a checkpoint file not appearing is a small thing
> however. It is more or less the most important artifact for recovery.
> It's not something that ever should have an issue like this.
>
> On Wed, Nov 29, 2017 at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
>> Hi devs,
>>
>> Recently I was experiencing a very annoying issue about recovery. The
>> checkpoint file of my dataset was somehow corrupted (and I didn't know
>> why). However, when I was restarting AsterixDB, it fails to read the
>> checkpoint file, and starts recovering as a clean state. This is highly
>> undesirable in the sense that it clean up all of my experiment datasets
>> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
>> resume my experiments.
>>
>> I think the behavior of cleaning up all data when some small thing goes
>> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
>> finds the data directory non-empty, I think it should notify the user and
>> let the user to make the decision. For example, it could fail to restart at
>> this time, and user could clean up the directory manually, or try to use a
>> backup checkpoint file, or add some flag to force restart. Anyway, blindly
>> cleaning up all files seem to be a dangerous solution.
>>
>> Any thoughts on this?
>>
>> Best regards,
>> Chen Luo


Re: About the system behavior when the checkpoint is corrupted

Posted by Ian Maxon <im...@uci.edu>.
I too have seen this issue, but I couldn't reproduce or surmise how it
might happen from just inspecting the code. How'd it appear for you?
I would disagree that a checkpoint file not appearing is a small thing
however. It is more or less the most important artifact for recovery.
It's not something that ever should have an issue like this.

On Wed, Nov 29, 2017 at 1:54 PM, Chen Luo <cl...@uci.edu> wrote:
> Hi devs,
>
> Recently I was experiencing a very annoying issue about recovery. The
> checkpoint file of my dataset was somehow corrupted (and I didn't know
> why). However, when I was restarting AsterixDB, it fails to read the
> checkpoint file, and starts recovering as a clean state. This is highly
> undesirable in the sense that it clean up all of my experiment datasets
> saliently, roughly 100GB. And it'll take me days to re-ingest these data to
> resume my experiments.
>
> I think the behavior of cleaning up all data when some small thing goes
> wrong is undesirable and dangerous. When AsterixDB fails to restart, and
> finds the data directory non-empty, I think it should notify the user and
> let the user to make the decision. For example, it could fail to restart at
> this time, and user could clean up the directory manually, or try to use a
> backup checkpoint file, or add some flag to force restart. Anyway, blindly
> cleaning up all files seem to be a dangerous solution.
>
> Any thoughts on this?
>
> Best regards,
> Chen Luo