You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by Roshan Naik <ro...@hortonworks.com> on 2013/05/31 23:53:44 UTC

File Channel issue - recovering from BadCheckpoint exception

In EventQueueBackingStoreFileV3 constructor, if it detects that the
checkpoint and meta files have differing logWriteOrderIds, it throws a
 BadCheckpointException. Controls goes back to the exception handler in
Log.replay() which attempts to delete all the files in checkpoint directory
and start fresh. The same file names are reused when starting fresh.

Unfortunately this does not work on Windows since the deletion of
the checkpoint file in the checkpointDir fails. The failure is due to the
fact that the checkpoint file is memory mapped. Unless it is unmapped the
deletion will not succeed... and unfortunately  Java does not have unmap
support. Windows does not permit deletion (or renaming) of files in use.

The obvious thought i am having is that when starting fresh we delete
whatever we can and invent a new file name for the ones we cant (i think
for checkpoint file only)

thoughts ?

-roshan

Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Brock Noland <br...@cloudera.com>.
I think we could add JUnit Assume statements for any tests which depend on
this value since it will be auto disabled on windows.


On Fri, May 31, 2013 at 6:15 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

> I am not sure who this is handled generally by Windows developers, but I'd
> assume there is a way to do that. I am fairly sure this is a known issue. I
> think the only thing we can do for now is to disable those unit tests if
> the build is on windows or have an if-else that tests the expected behavior
> on Windows. I don't really like having different behavior on Windows and
> posix platforms, but if the platform itself behaves in a specific way, I
> doubt there is anything we can do.
>
> In case of the dual checkpoints, we might be ok - because we actually
> don't open the files. We just create them and then copy the content and
> then close them.
>
>
> Cheers,
> Hari
>
>
> On Friday, May 31, 2013 at 4:01 PM, Roshan Naik wrote:
>
> > i am concerned several unit tests might be dependent on the
> auto-deletion.
> >
> >
> > On Fri, May 31, 2013 at 3:57 PM, Hari Shreedharan <
> hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > > wrote:
> >
> >
> > > Roshan,
> > >
> > > No, that would break all config files from Flume 1.3.0 and Flume
> 1.3.1. We
> > > should probably have some code that specifically disables this on
> Windows
> > > and clearly document that.
> > >
> > >
> > > Cheers,
> > > Hari
> > >
> > >
> > > On Friday, May 31, 2013 at 3:51 PM, Roshan Naik wrote:
> > >
> > > > Would it make sense for default config setting for the auto-deletion
> to
> > > be
> > > > set to 'false' then ?
> > > >
> > > >
> > > > On Fri, May 31, 2013 at 3:16 PM, Hari Shreedharan <
> > > hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > > > > wrote:
> > > >
> > > >
> > > >
> > > > > For now, how about making the auto-deletion configurable? If it is
> > > > > configured not to delete, then don't even try to startup the
> channel.
> > > > >
> > > >
> > > >
> > >
> > > This
> > > > > will bring in the pre-1.3.0 behavior where the channel's recovery
> is
> > > > > manual? I suspect you are going to hit many more issues when you
> enable
> > > > > dual checkpoints - and fixing that is going to be non-trivial.
> > > > >
> > > > > Cheers,
> > > > > Hari
> > > > >
> > > > >
> > > > > On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:
> > > > >
> > > > > > In EventQueueBackingStoreFileV3 constructor, if it detects that
> the
> > > > > > checkpoint and meta files have differing logWriteOrderIds, it
> throws
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > > a
> > > > > > BadCheckpointException. Controls goes back to the exception
> handler
> > > > >
> > > >
> > >
> > > in
> > > > > > Log.replay() which attempts to delete all the files in checkpoint
> > > > >
> > > > >
> > > > > directory
> > > > > > and start fresh. The same file names are reused when starting
> fresh.
> > > > > >
> > > > > > Unfortunately this does not work on Windows since the deletion of
> > > > > > the checkpoint file in the checkpointDir fails. The failure is
> due
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > > to the
> > > > > > fact that the checkpoint file is memory mapped. Unless it is
> > > > >
> > > >
> > >
> > > unmapped the
> > > > > > deletion will not succeed... and unfortunately Java does not have
> > > > >
> > > >
> > >
> > > unmap
> > > > > > support. Windows does not permit deletion (or renaming) of files
> in
> > > > >
> > > >
> > >
> > > use.
> > > > > >
> > > > > > The obvious thought i am having is that when starting fresh we
> delete
> > > > > > whatever we can and invent a new file name for the ones we cant
> (i
> > > > > >
> > > > >
> > > >
> > >
> > > think
> > > > > > for checkpoint file only)
> > > > > >
> > > > > > thoughts ?
> > > > > >
> > > > > > -roshan
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Hari Shreedharan <hs...@cloudera.com>.
I am not sure who this is handled generally by Windows developers, but I'd assume there is a way to do that. I am fairly sure this is a known issue. I think the only thing we can do for now is to disable those unit tests if the build is on windows or have an if-else that tests the expected behavior on Windows. I don't really like having different behavior on Windows and posix platforms, but if the platform itself behaves in a specific way, I doubt there is anything we can do.  

In case of the dual checkpoints, we might be ok - because we actually don't open the files. We just create them and then copy the content and then close them. 


Cheers,
Hari


On Friday, May 31, 2013 at 4:01 PM, Roshan Naik wrote:

> i am concerned several unit tests might be dependent on the auto-deletion.
> 
> 
> On Fri, May 31, 2013 at 3:57 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > wrote:
> 
> 
> > Roshan,
> > 
> > No, that would break all config files from Flume 1.3.0 and Flume 1.3.1. We
> > should probably have some code that specifically disables this on Windows
> > and clearly document that.
> > 
> > 
> > Cheers,
> > Hari
> > 
> > 
> > On Friday, May 31, 2013 at 3:51 PM, Roshan Naik wrote:
> > 
> > > Would it make sense for default config setting for the auto-deletion to
> > be
> > > set to 'false' then ?
> > > 
> > > 
> > > On Fri, May 31, 2013 at 3:16 PM, Hari Shreedharan <
> > hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > > > wrote:
> > > 
> > > 
> > > 
> > > > For now, how about making the auto-deletion configurable? If it is
> > > > configured not to delete, then don't even try to startup the channel.
> > > > 
> > > 
> > > 
> > 
> > This
> > > > will bring in the pre-1.3.0 behavior where the channel's recovery is
> > > > manual? I suspect you are going to hit many more issues when you enable
> > > > dual checkpoints - and fixing that is going to be non-trivial.
> > > > 
> > > > Cheers,
> > > > Hari
> > > > 
> > > > 
> > > > On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:
> > > > 
> > > > > In EventQueueBackingStoreFileV3 constructor, if it detects that the
> > > > > checkpoint and meta files have differing logWriteOrderIds, it throws
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > a
> > > > > BadCheckpointException. Controls goes back to the exception handler
> > > > 
> > > 
> > 
> > in
> > > > > Log.replay() which attempts to delete all the files in checkpoint
> > > > 
> > > > 
> > > > directory
> > > > > and start fresh. The same file names are reused when starting fresh.
> > > > > 
> > > > > Unfortunately this does not work on Windows since the deletion of
> > > > > the checkpoint file in the checkpointDir fails. The failure is due
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > to the
> > > > > fact that the checkpoint file is memory mapped. Unless it is
> > > > 
> > > 
> > 
> > unmapped the
> > > > > deletion will not succeed... and unfortunately Java does not have
> > > > 
> > > 
> > 
> > unmap
> > > > > support. Windows does not permit deletion (or renaming) of files in
> > > > 
> > > 
> > 
> > use.
> > > > > 
> > > > > The obvious thought i am having is that when starting fresh we delete
> > > > > whatever we can and invent a new file name for the ones we cant (i
> > > > > 
> > > > 
> > > 
> > 
> > think
> > > > > for checkpoint file only)
> > > > > 
> > > > > thoughts ?
> > > > > 
> > > > > -roshan 


Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Roshan Naik <ro...@hortonworks.com>.
i am concerned several unit tests might be dependent on the auto-deletion.


On Fri, May 31, 2013 at 3:57 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

> Roshan,
>
> No, that would break all config files from Flume 1.3.0 and Flume 1.3.1. We
> should probably have some code that specifically disables this on Windows
> and clearly document that.
>
>
> Cheers,
> Hari
>
>
> On Friday, May 31, 2013 at 3:51 PM, Roshan Naik wrote:
>
> > Would it make sense for default config setting for the auto-deletion to
> be
> > set to 'false' then ?
> >
> >
> > On Fri, May 31, 2013 at 3:16 PM, Hari Shreedharan <
> hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > > wrote:
> >
> >
> > > For now, how about making the auto-deletion configurable? If it is
> > > configured not to delete, then don't even try to startup the channel.
> This
> > > will bring in the pre-1.3.0 behavior where the channel's recovery is
> > > manual? I suspect you are going to hit many more issues when you enable
> > > dual checkpoints - and fixing that is going to be non-trivial.
> > >
> > > Cheers,
> > > Hari
> > >
> > >
> > > On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:
> > >
> > > > In EventQueueBackingStoreFileV3 constructor, if it detects that the
> > > > checkpoint and meta files have differing logWriteOrderIds, it throws
> a
> > > > BadCheckpointException. Controls goes back to the exception handler
> in
> > > > Log.replay() which attempts to delete all the files in checkpoint
> > > >
> > >
> > > directory
> > > > and start fresh. The same file names are reused when starting fresh.
> > > >
> > > > Unfortunately this does not work on Windows since the deletion of
> > > > the checkpoint file in the checkpointDir fails. The failure is due
> to the
> > > > fact that the checkpoint file is memory mapped. Unless it is
> unmapped the
> > > > deletion will not succeed... and unfortunately Java does not have
> unmap
> > > > support. Windows does not permit deletion (or renaming) of files in
> use.
> > > >
> > > > The obvious thought i am having is that when starting fresh we delete
> > > > whatever we can and invent a new file name for the ones we cant (i
> think
> > > > for checkpoint file only)
> > > >
> > > > thoughts ?
> > > >
> > > > -roshan
>
>

Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Hari Shreedharan <hs...@cloudera.com>.
Roshan,  

No, that would break all config files from Flume 1.3.0 and Flume 1.3.1. We should probably have some code that specifically disables this on Windows and clearly document that.  


Cheers,
Hari


On Friday, May 31, 2013 at 3:51 PM, Roshan Naik wrote:

> Would it make sense for default config setting for the auto-deletion to be
> set to 'false' then ?
> 
> 
> On Fri, May 31, 2013 at 3:16 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)
> > wrote:
> 
> 
> > For now, how about making the auto-deletion configurable? If it is
> > configured not to delete, then don't even try to startup the channel. This
> > will bring in the pre-1.3.0 behavior where the channel's recovery is
> > manual? I suspect you are going to hit many more issues when you enable
> > dual checkpoints - and fixing that is going to be non-trivial.
> > 
> > Cheers,
> > Hari
> > 
> > 
> > On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:
> > 
> > > In EventQueueBackingStoreFileV3 constructor, if it detects that the
> > > checkpoint and meta files have differing logWriteOrderIds, it throws a
> > > BadCheckpointException. Controls goes back to the exception handler in
> > > Log.replay() which attempts to delete all the files in checkpoint
> > > 
> > 
> > directory
> > > and start fresh. The same file names are reused when starting fresh.
> > > 
> > > Unfortunately this does not work on Windows since the deletion of
> > > the checkpoint file in the checkpointDir fails. The failure is due to the
> > > fact that the checkpoint file is memory mapped. Unless it is unmapped the
> > > deletion will not succeed... and unfortunately Java does not have unmap
> > > support. Windows does not permit deletion (or renaming) of files in use.
> > > 
> > > The obvious thought i am having is that when starting fresh we delete
> > > whatever we can and invent a new file name for the ones we cant (i think
> > > for checkpoint file only)
> > > 
> > > thoughts ?
> > > 
> > > -roshan 


Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Roshan Naik <ro...@hortonworks.com>.
Would it make sense for default config setting for the auto-deletion to be
set to 'false'  then ?


On Fri, May 31, 2013 at 3:16 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

> For now, how about making the auto-deletion configurable? If it is
> configured not to delete, then don't even try to startup the channel. This
> will bring in the pre-1.3.0 behavior where the channel's recovery is
> manual? I suspect you are going to hit many more issues when you enable
> dual checkpoints - and fixing that is going to be non-trivial.
>
> Cheers,
> Hari
>
>
> On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:
>
> > In EventQueueBackingStoreFileV3 constructor, if it detects that the
> > checkpoint and meta files have differing logWriteOrderIds, it throws a
> > BadCheckpointException. Controls goes back to the exception handler in
> > Log.replay() which attempts to delete all the files in checkpoint
> directory
> > and start fresh. The same file names are reused when starting fresh.
> >
> > Unfortunately this does not work on Windows since the deletion of
> > the checkpoint file in the checkpointDir fails. The failure is due to the
> > fact that the checkpoint file is memory mapped. Unless it is unmapped the
> > deletion will not succeed... and unfortunately Java does not have unmap
> > support. Windows does not permit deletion (or renaming) of files in use.
> >
> > The obvious thought i am having is that when starting fresh we delete
> > whatever we can and invent a new file name for the ones we cant (i think
> > for checkpoint file only)
> >
> > thoughts ?
> >
> > -roshan
>
>

Re: File Channel issue - recovering from BadCheckpoint exception

Posted by Hari Shreedharan <hs...@cloudera.com>.
For now, how about making the auto-deletion configurable? If it is configured not to delete, then don't even try to startup the channel. This will bring in the pre-1.3.0 behavior where the channel's recovery is manual? I suspect you are going to hit many more issues when you enable dual checkpoints - and fixing that is going to be non-trivial. 

Cheers,
Hari


On Friday, May 31, 2013 at 2:53 PM, Roshan Naik wrote:

> In EventQueueBackingStoreFileV3 constructor, if it detects that the
> checkpoint and meta files have differing logWriteOrderIds, it throws a
> BadCheckpointException. Controls goes back to the exception handler in
> Log.replay() which attempts to delete all the files in checkpoint directory
> and start fresh. The same file names are reused when starting fresh.
> 
> Unfortunately this does not work on Windows since the deletion of
> the checkpoint file in the checkpointDir fails. The failure is due to the
> fact that the checkpoint file is memory mapped. Unless it is unmapped the
> deletion will not succeed... and unfortunately Java does not have unmap
> support. Windows does not permit deletion (or renaming) of files in use.
> 
> The obvious thought i am having is that when starting fresh we delete
> whatever we can and invent a new file name for the ones we cant (i think
> for checkpoint file only)
> 
> thoughts ?
> 
> -roshan