You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2008/03/27 23:34:25 UTC

[jira] Created: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
---------------------------------------------------------------------------------------------------

Key: HADOOP-3113
URL: https://issues.apache.org/jira/browse/HADOOP-3113
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Reporter: dhruba borthakur

DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.

However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.

The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.

The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.

Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: tmpFile.patch

Incorporated Nicholas' code review comments.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602125#action_12602125 ] 

Hadoop QA commented on HADOOP-3113:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383325/tmpFile.patch
  against trunk revision 662913.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2557/console

This message is automatically generated.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Summary: DFSOututStream.flush() should flush data to real block file on DataNode.  (was: Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.)

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: tmpFile.patch

This patch moves all tmp files to the real data directory on a datanode restart.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602095#action_12602095 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3113:
------------------------------------------------

Patch looks good.  Some minor comments:

- In INodeFileUnderConstruction.setTargets(...), add "this.primaryNodeIndex = -1;"

- Add private to INodeFileUnderConstruction.targets.  Then, call pendingFile.setTargets(...) in FSNamesystem.getAdditionalBlock(...).


> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: noTmpFile.patch

This is the first version of the patch. When the client encounters an error while writing to a block, it invokes generation-stamp recovery on the primary datanode.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: tmpFile.patch

Fixed a unit test TestInterDatanodeProtocol that was trying to finalize a block inspite of the fact that the file was already closed.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3113:
------------------------------------

    Release Note: Added sync() method to FSDataOutputStream to really, really persist data in HDFS. InterDatanodeProtocol to implement this feature.  (was: An application can invoke sync on the FSDataOutputStream to really, really persist data in HDFS! This is an incompatible change becuase it required changes to InterDatanodeProtocol.)
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.18.0
>
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583413#action_12583413 ] 

dhruba borthakur commented on HADOOP-3113:
------------------------------------------

I can make the configurable to be per file, but maybe it makes more sense to make it applicable to the entire system. The reason being that datanodes do not know much about the name of the HDFS file that a block belongs to. To make this configurable "per file" would need lots of protocol change.

The default for block report is 1 hour and the default for lease timeout is 1 hour too. This needs to be changed so that block reports are sent evenry 30 minutes.

If a client dies while writing to the last block of that file, that block is not yet part of the blocksmap in the namenode. (A block gets inserted in the blocksmap when a complete block is received by the datanode and it sends a blockReceived message to the namenode). If the lease for this file on the namenode expires before the block report from the datanode arrives, then the namenode will erroneously think that no datanodes have a copy of that block. As part of lease recovery, the namenode will delete the last block of the file because it has no entry in the blocksMap. To prevent this from occuring, the block report periodicity should be set to 30 minutes.

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: noTmpFile.patch

This patch prevents the datanodes from creating temporary files for blocks that are being written to.

The configuration parameter dfs.blockreport.intervalMsec should be changed from a default value of 3600000 to 1800000.


> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583201#action_12583201 ] 

stack commented on HADOOP-3113:
-------------------------------

I volunteer to exercise any patch posted (Smile)

Can the configuration be specified on a per-file basis?  If not, this approach has less value since hbase is rarely the only user of an HDFS installation.

Whats the default config. for block report periodicity vs. lease timeout currently?  (Is this dfs.blockreport.intervalMsec -- 1hr -- vs. which config?)  This config. could not be done per file, right?  It'd be a global config?

Thanks Dhruba

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602222#action_12602222 ] 

Hadoop QA commented on HADOOP-3113:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383353/tmpFile.patch
  against trunk revision 662976.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2566/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2566/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2566/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2566/console

This message is automatically generated.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602051#action_12602051 ] 

dhruba borthakur commented on HADOOP-3113:
------------------------------------------

The latest patch keeps blocks that are being written in the tmp directory. However, when a block is finalized it moves into the real block directory. Also, at datanode restart, all blocks from the tmp directory move to the real block directory. It is prudent to keep the blocks in the tmp directory while they are being updated because some datanode-local processing (e.g. CRC validation) might need to occur when they get moved from tmp dir to real block dir (at datanode restart).

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3113:
-------------------------------------------

    Hadoop Flags: [Reviewed]

+1 codes look good

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586262#action_12586262 ] 

dhruba borthakur commented on HADOOP-3113:
------------------------------------------

Hi Jim, thanks for your comments. I am still trying to figure out a way to address Jim's earlier requirement.... if the writer dies, how to make the file available for writing without waiting for the next block report. Let me see if I can come up with something to address this issue.

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Attachment: tmpFile.patch

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602459#action_12602459 ] 

stack commented on HADOOP-3113:
-------------------------------

> An application can invoke sync on the FSDataOutputStream to really, really persist data in HDFS!

Hot dog!

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.18.0
>
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Status: Open  (was: Patch Available)

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583460#action_12583460 ] 

Jim Kellerman commented on HADOOP-3113:
---------------------------------------

> dhruba borthakur - 29/Mar/08 11:03 PM
> I can make the configurable to be per file, but maybe it makes more sense to make it applicable to the
> entire system. The reason being that datanodes do not know much about the name of the HDFS file
> that a block belongs to. To make this configurable "per file" would need lots of protocol change.

I don't think it needs to be per file. Aside from our redo log, other files are written and then immediately
closed and re-opened for read.

> If a client dies while writing to the last block of that file, that block is not yet part of the blocksmap in the
> namenode. (A block gets inserted in the blocksmap when a complete block is received by the datanode
> and it sends a blockReceived message to the namenode). If the lease for this file on the namenode
> expires before the block report from the datanode arrives, then the namenode will erroneously think
> that no datanodes have a copy of that block. As part of lease recovery, the namenode will delete the
> last block of the file because it has no entry in the blocksMap. To prevent this from occuring, the block
> report periodicity should be set to 30 minutes.

I think this is ok, but let me give a scenario to verify that my understanding is correct.

We open our redo log and flush it either every N seconds or after M records have been written.
If the process writing the log crashes, we will notice much sooner than the file lease timeout.
At that point another process should be able to open the file for read, and all flushed data 
will be visible, unflushed data will not. Since the amount of unflushed data should be small
the amount of data lost should be minimal. Once the redo log has been read and processed,
the file will be deleted by the process reading the file.

If this is how this patch works, +1.

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.18.0
     Release Note: An application can invoke sync on the FSDataOutputStream to really, really persist data in HDFS! This is an incompatible change becuase it required changes to InterDatanodeProtocol.
     Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed])
           Status: Resolved  (was: Patch Available)

I just committed this.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.18.0
>
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned HADOOP-3113:
----------------------------------------

    Assignee: dhruba borthakur

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Status: Patch Available  (was: Open)

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584353#action_12584353 ] 

stack commented on HADOOP-3113:
-------------------------------

> I can make the configurable to be per file, but maybe it makes more sense to make it applicable to the entire system. 

When the setting is not per file:

1. HBase often is new-kid on the block and but one of the many users of an HDFS install.  HBase installers may find it difficult convincing HDFS admins they need to make the change.
2. If per-file, HBase can manage the configuration.  Otherwise, its a two-step process.  HBase installers may plain forget.
3. Minor-point: HBase needs append only for its Write-Ahead Log; nowhere else.

If its an 'entire system' setting, will it require an HDFS restart to take effect?

Sounds like we should set the blocksize for our Write-Ahead Log to be a good deal smaller than default.

I took a quick look at the patch.

The hadoop-default.xml entry description is all on one line.  You might want to break it up.  Also, is there a downside to setting the dfs.datanode.skipTmpFile flag (Reading the description, in my head I'm thinking there must be or why even bother with this configuration?)

Otherwise, patch looks good to me. 

Do you have a suggestion for a test I might run to exercise this new functionality?

Thanks Dhruba



> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592905#action_12592905 ] 

dhruba borthakur commented on HADOOP-3113:
------------------------------------------

There isn't any short-cut for this one. We need lease recovery HADOOP-3310 as a pre-requisite.

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3113) DFSOututStream.flush() should flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-3113:
-------------------------------------

    Status: Patch Available  (was: Open)

> DFSOututStream.flush() should flush data to real block file on DataNode.
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch, noTmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch, tmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586262#action_12586262 ] 

dhruba edited comment on HADOOP-3113 at 4/7/08 12:58 AM:
-------------------------------------------------------------------

Hi Stack, thanks for your comments. I am still trying to figure out a way to address Jim's earlier requirement.... if the writer dies, how to make the file available for writing without waiting for the next block report. Let me see if I can come up with something to address this issue.

      was (Author: dhruba):
    Hi Jim, thanks for your comments. I am still trying to figure out a way to address Jim's earlier requirement.... if the writer dies, how to make the file available for writing without waiting for the next block report. Let me see if I can come up with something to address this issue.
  
> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3113) Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583752#action_12583752 ] 

dhruba borthakur commented on HADOOP-3113:
------------------------------------------

Hi Jim, There is one issue that you pointed out. The patch, as it currently stands, will not behave the way you want. If the process writing the log dies, the last block is not fully written to datanode(s). This means that the namenode does not (yet) know the block locations of the last block. This will be available to the namenode only at the next block report from the datanode(s). If you happen to reopen the file before the next block report arrives you will not see the last block. Let me see if I can come up with something for this one.

> Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3113
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: noTmpFile.patch
>
>
> DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
> However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
> The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
> The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
> Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.