You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Chen Luo <cl...@uci.edu> on 2019/10/27 17:26:13 UTC

Checkpoint file not forced to disk

Hi devs,

I noticed the checkpoint file is not forced to disk after completion, but
we still proceed to truncate logs and older checkpoint files [1]. This
seems to be a bug to me. Also, from my understanding, reading the
checkpoint file without forcing to disk will still succeed because the file
can be read from the OS write cache. Is there any other considerations for
not forcing checkpoint files?


[1]
https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176

Re: Checkpoint file not forced to disk

Posted by Mike Carey <dt...@gmail.com>.

Got it - but this does mean log reclamation has to be dependent on 
knowing that a successful checkpoint flush has occurred.

On 10/28/19 5:22 AM, Murtadha Hubail wrote:
> @Mike Carey,
>
> For this specific system checkpoint file, we have a test that ensures even if the file is corrupted, the node will continue its startup and we will just force recovery to start from the beginning of the transaction log.
>   
> @Chen Luo,
>
> Thanks for checking and highlighting the issue. Looks like forcing a file on closing its stream is OS dependent. Apparently, we also need FileDescriptor.sync() for non-local storage to guarantee a file is forced. We have other metadata files that we need to do this for. If you don't mind, I will submit a patch to cover all of these and I will include the changes from your patch [1].
>
> Cheers,
> Murtadha
>
> [1] https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/3923
>
> On 10/28/2019, 9:56 AM, "Mike Carey" <dt...@gmail.com> wrote:
>
>      Sounds right ... Explicit call needed!  Better crash testing also needed;
>      this one should not be impossible to test for.  Do we have any crash
>      testing that we can extend?
>      
>      On Sun, Oct 27, 2019, 7:15 PM Chen Luo <cl...@uci.edu> wrote:
>      
>      > HI Murtadha,
>      >
>      > I don't think closing a file will automatically force it to disk. It only
>      > forces the buffer from JVM to OS. I did some search online and also checked
>      > the source code of JDK 8. When a file channel is closed, only the close
>      > function (provided by Linux) is called [1]. The only place that calls fsync
>      > is inside the FileChannel.force() method [2]. For correctness, I believe
>      > force should be called explicitly, as we did when an LSM merge is
>      > completed.
>      >
>      > [1]
>      >
>      > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
>      > [2]
>      >
>      > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145
>      >
>      > Best regards,
>      > Chen Luo
>      >
>      > On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <hu...@gmail.com>
>      > wrote:
>      >
>      > > Hi Chen,
>      > >
>      > > If I'm not mistaken, Files#write ensures that the file is closed after
>      > > writing the bytes which should flush the file. If not, then we probably
>      > > should add an explicit flush there.
>      > >
>      > > Cheers,
>      > > Murtadha
>      > >
>      > > On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
>      > >
>      > >     Hi devs,
>      > >
>      > >     I noticed the checkpoint file is not forced to disk after completion,
>      > > but
>      > >     we still proceed to truncate logs and older checkpoint files [1].
>      > This
>      > >     seems to be a bug to me. Also, from my understanding, reading the
>      > >     checkpoint file without forcing to disk will still succeed because
>      > the
>      > > file
>      > >     can be read from the OS write cache. Is there any other
>      > considerations
>      > > for
>      > >     not forcing checkpoint files?
>      > >
>      > >
>      > >     [1]
>      > >
>      > >
>      > https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
>      > >
>      > >
>      > >
>      > >
>      >
>      
>
>

Re: Checkpoint file not forced to disk

Posted by Murtadha Hubail <hu...@gmail.com>.

@Mike Carey,

For this specific system checkpoint file, we have a test that ensures even if the file is corrupted, the node will continue its startup and we will just force recovery to start from the beginning of the transaction log.
 
@Chen Luo,

Thanks for checking and highlighting the issue. Looks like forcing a file on closing its stream is OS dependent. Apparently, we also need FileDescriptor.sync() for non-local storage to guarantee a file is forced. We have other metadata files that we need to do this for. If you don't mind, I will submit a patch to cover all of these and I will include the changes from your patch [1].

Cheers,
Murtadha

[1] https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/3923

On 10/28/2019, 9:56 AM, "Mike Carey" <dt...@gmail.com> wrote:

    Sounds right ... Explicit call needed!  Better crash testing also needed;
    this one should not be impossible to test for.  Do we have any crash
    testing that we can extend?
    
    On Sun, Oct 27, 2019, 7:15 PM Chen Luo <cl...@uci.edu> wrote:
    
    > HI Murtadha,
    >
    > I don't think closing a file will automatically force it to disk. It only
    > forces the buffer from JVM to OS. I did some search online and also checked
    > the source code of JDK 8. When a file channel is closed, only the close
    > function (provided by Linux) is called [1]. The only place that calls fsync
    > is inside the FileChannel.force() method [2]. For correctness, I believe
    > force should be called explicitly, as we did when an LSM merge is
    > completed.
    >
    > [1]
    >
    > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
    > [2]
    >
    > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145
    >
    > Best regards,
    > Chen Luo
    >
    > On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <hu...@gmail.com>
    > wrote:
    >
    > > Hi Chen,
    > >
    > > If I'm not mistaken, Files#write ensures that the file is closed after
    > > writing the bytes which should flush the file. If not, then we probably
    > > should add an explicit flush there.
    > >
    > > Cheers,
    > > Murtadha
    > >
    > > On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
    > >
    > >     Hi devs,
    > >
    > >     I noticed the checkpoint file is not forced to disk after completion,
    > > but
    > >     we still proceed to truncate logs and older checkpoint files [1].
    > This
    > >     seems to be a bug to me. Also, from my understanding, reading the
    > >     checkpoint file without forcing to disk will still succeed because
    > the
    > > file
    > >     can be read from the OS write cache. Is there any other
    > considerations
    > > for
    > >     not forcing checkpoint files?
    > >
    > >
    > >     [1]
    > >
    > >
    > https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
    > >
    > >
    > >
    > >
    >

Re: Checkpoint file not forced to disk

Posted by Chen Luo <cl...@uci.edu>.

Hi Murtadha,

Feel free to abandon my submitted patch. PS - the current transaction log
and LSM disk components also use FileChannel.force() to force data to disk.

Best regards,
Chen Luo

On Mon, Oct 28, 2019 at 6:25 AM Murtadha Hubail <hu...@gmail.com> wrote:

> P.S.
> Looks like FileChannel.force() is enough according to [1] and we don't
> need FileDescriptor.sync().
>
> [1]
> https://stackoverflow.com/questions/5650327/are-filechannel-force-and-filedescriptor-sync-both-needed
>
>
> On 10/28/2019, 3:22 PM, "Murtadha Hubail" <hu...@gmail.com> wrote:
>
>     @Mike Carey,
>
>     For this specific system checkpoint file, we have a test that ensures
> even if the file is corrupted, the node will continue its startup and we
> will just force recovery to start from the beginning of the transaction log.
>
>     @Chen Luo,
>
>     Thanks for checking and highlighting the issue. Looks like forcing a
> file on closing its stream is OS dependent. Apparently, we also need
> FileDescriptor.sync() for non-local storage to guarantee a file is forced.
> We have other metadata files that we need to do this for. If you don't
> mind, I will submit a patch to cover all of these and I will include the
> changes from your patch [1].
>
>     Cheers,
>     Murtadha
>
>     [1] https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/3923
>
>     On 10/28/2019, 9:56 AM, "Mike Carey" <dt...@gmail.com> wrote:
>
>         Sounds right ... Explicit call needed!  Better crash testing also
> needed;
>         this one should not be impossible to test for.  Do we have any
> crash
>         testing that we can extend?
>
>         On Sun, Oct 27, 2019, 7:15 PM Chen Luo <cl...@uci.edu> wrote:
>
>         > HI Murtadha,
>         >
>         > I don't think closing a file will automatically force it to
> disk. It only
>         > forces the buffer from JVM to OS. I did some search online and
> also checked
>         > the source code of JDK 8. When a file channel is closed, only
> the close
>         > function (provided by Linux) is called [1]. The only place that
> calls fsync
>         > is inside the FileChannel.force() method [2]. For correctness, I
> believe
>         > force should be called explicitly, as we did when an LSM merge is
>         > completed.
>         >
>         > [1]
>         >
>         >
> https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
>         > [2]
>         >
>         >
> https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145
>         >
>         > Best regards,
>         > Chen Luo
>         >
>         > On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <
> hubailmor@gmail.com>
>         > wrote:
>         >
>         > > Hi Chen,
>         > >
>         > > If I'm not mistaken, Files#write ensures that the file is
> closed after
>         > > writing the bytes which should flush the file. If not, then we
> probably
>         > > should add an explicit flush there.
>         > >
>         > > Cheers,
>         > > Murtadha
>         > >
>         > > On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
>         > >
>         > >     Hi devs,
>         > >
>         > >     I noticed the checkpoint file is not forced to disk after
> completion,
>         > > but
>         > >     we still proceed to truncate logs and older checkpoint
> files [1].
>         > This
>         > >     seems to be a bug to me. Also, from my understanding,
> reading the
>         > >     checkpoint file without forcing to disk will still succeed
> because
>         > the
>         > > file
>         > >     can be read from the OS write cache. Is there any other
>         > considerations
>         > > for
>         > >     not forcing checkpoint files?
>         > >
>         > >
>         > >     [1]
>         > >
>         > >
>         >
> https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
>         > >
>         > >
>         > >
>         > >
>         >
>
>
>
>
>

Re: Checkpoint file not forced to disk

Posted by Murtadha Hubail <hu...@gmail.com>.

P.S.
Looks like FileChannel.force() is enough according to [1] and we don't need FileDescriptor.sync().

[1] https://stackoverflow.com/questions/5650327/are-filechannel-force-and-filedescriptor-sync-both-needed


On 10/28/2019, 3:22 PM, "Murtadha Hubail" <hu...@gmail.com> wrote:

    @Mike Carey,
    
    For this specific system checkpoint file, we have a test that ensures even if the file is corrupted, the node will continue its startup and we will just force recovery to start from the beginning of the transaction log.
     
    @Chen Luo,
    
    Thanks for checking and highlighting the issue. Looks like forcing a file on closing its stream is OS dependent. Apparently, we also need FileDescriptor.sync() for non-local storage to guarantee a file is forced. We have other metadata files that we need to do this for. If you don't mind, I will submit a patch to cover all of these and I will include the changes from your patch [1].
    
    Cheers,
    Murtadha
    
    [1] https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/3923
    
    On 10/28/2019, 9:56 AM, "Mike Carey" <dt...@gmail.com> wrote:
    
        Sounds right ... Explicit call needed!  Better crash testing also needed;
        this one should not be impossible to test for.  Do we have any crash
        testing that we can extend?
        
        On Sun, Oct 27, 2019, 7:15 PM Chen Luo <cl...@uci.edu> wrote:
        
        > HI Murtadha,
        >
        > I don't think closing a file will automatically force it to disk. It only
        > forces the buffer from JVM to OS. I did some search online and also checked
        > the source code of JDK 8. When a file channel is closed, only the close
        > function (provided by Linux) is called [1]. The only place that calls fsync
        > is inside the FileChannel.force() method [2]. For correctness, I believe
        > force should be called explicitly, as we did when an LSM merge is
        > completed.
        >
        > [1]
        >
        > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
        > [2]
        >
        > https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145
        >
        > Best regards,
        > Chen Luo
        >
        > On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <hu...@gmail.com>
        > wrote:
        >
        > > Hi Chen,
        > >
        > > If I'm not mistaken, Files#write ensures that the file is closed after
        > > writing the bytes which should flush the file. If not, then we probably
        > > should add an explicit flush there.
        > >
        > > Cheers,
        > > Murtadha
        > >
        > > On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
        > >
        > >     Hi devs,
        > >
        > >     I noticed the checkpoint file is not forced to disk after completion,
        > > but
        > >     we still proceed to truncate logs and older checkpoint files [1].
        > This
        > >     seems to be a bug to me. Also, from my understanding, reading the
        > >     checkpoint file without forcing to disk will still succeed because
        > the
        > > file
        > >     can be read from the OS write cache. Is there any other
        > considerations
        > > for
        > >     not forcing checkpoint files?
        > >
        > >
        > >     [1]
        > >
        > >
        > https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
        > >
        > >
        > >
        > >
        >

Re: Checkpoint file not forced to disk

Posted by Mike Carey <dt...@gmail.com>.

Sounds right ... Explicit call needed!  Better crash testing also needed;
this one should not be impossible to test for.  Do we have any crash
testing that we can extend?

On Sun, Oct 27, 2019, 7:15 PM Chen Luo <cl...@uci.edu> wrote:

> HI Murtadha,
>
> I don't think closing a file will automatically force it to disk. It only
> forces the buffer from JVM to OS. I did some search online and also checked
> the source code of JDK 8. When a file channel is closed, only the close
> function (provided by Linux) is called [1]. The only place that calls fsync
> is inside the FileChannel.force() method [2]. For correctness, I believe
> force should be called explicitly, as we did when an LSM merge is
> completed.
>
> [1]
>
> https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
> [2]
>
> https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145
>
> Best regards,
> Chen Luo
>
> On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <hu...@gmail.com>
> wrote:
>
> > Hi Chen,
> >
> > If I'm not mistaken, Files#write ensures that the file is closed after
> > writing the bytes which should flush the file. If not, then we probably
> > should add an explicit flush there.
> >
> > Cheers,
> > Murtadha
> >
> > On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
> >
> >     Hi devs,
> >
> >     I noticed the checkpoint file is not forced to disk after completion,
> > but
> >     we still proceed to truncate logs and older checkpoint files [1].
> This
> >     seems to be a bug to me. Also, from my understanding, reading the
> >     checkpoint file without forcing to disk will still succeed because
> the
> > file
> >     can be read from the OS write cache. Is there any other
> considerations
> > for
> >     not forcing checkpoint files?
> >
> >
> >     [1]
> >
> >
> https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
> >
> >
> >
> >
>

Re: Checkpoint file not forced to disk

Posted by Chen Luo <cl...@uci.edu>.

HI Murtadha,

I don't think closing a file will automatically force it to disk. It only
forces the buffer from JVM to OS. I did some search online and also checked
the source code of JDK 8. When a file channel is closed, only the close
function (provided by Linux) is called [1]. The only place that calls fsync
is inside the FileChannel.force() method [2]. For correctness, I believe
force should be called explicitly, as we did when an LSM merge is completed.

[1]
https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L273
[2]
https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#L145

Best regards,
Chen Luo

On Sun, Oct 27, 2019 at 5:24 PM Murtadha Hubail <hu...@gmail.com> wrote:

> Hi Chen,
>
> If I'm not mistaken, Files#write ensures that the file is closed after
> writing the bytes which should flush the file. If not, then we probably
> should add an explicit flush there.
>
> Cheers,
> Murtadha
>
> On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:
>
>     Hi devs,
>
>     I noticed the checkpoint file is not forced to disk after completion,
> but
>     we still proceed to truncate logs and older checkpoint files [1]. This
>     seems to be a bug to me. Also, from my understanding, reading the
>     checkpoint file without forcing to disk will still succeed because the
> file
>     can be read from the OS write cache. Is there any other considerations
> for
>     not forcing checkpoint files?
>
>
>     [1]
>
> https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176
>
>
>
>

Re: Checkpoint file not forced to disk

Posted by Murtadha Hubail <hu...@gmail.com>.

Hi Chen,

If I'm not mistaken, Files#write ensures that the file is closed after writing the bytes which should flush the file. If not, then we probably should add an explicit flush there.

Cheers,
Murtadha

On 10/27/2019, 8:26 PM, "Chen Luo" <cl...@uci.edu> wrote:

    Hi devs,
    
    I noticed the checkpoint file is not forced to disk after completion, but
    we still proceed to truncate logs and older checkpoint files [1]. This
    seems to be a bug to me. Also, from my understanding, reading the
    checkpoint file without forcing to disk will still succeed because the file
    can be read from the OS write cache. Is there any other considerations for
    not forcing checkpoint files?
    
    
    [1]
    https://github.com/apache/asterixdb/blob/2a76a0fe83fc5534c00923cd0f09f8477eac713a/asterixdb/asterix-transactions/src/main/java/org/apache/asterix/transaction/management/service/recovery/AbstractCheckpointManager.java#L176