You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2009/10/10 00:32:31 UTC

[jira] Created: (HADOOP-6307) Support reading on un-closed SequenceFile

Support reading on un-closed SequenceFile
-----------------------------------------

                 Key: HADOOP-6307
                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
             Project: Hadoop Common
          Issue Type: Improvement
          Components: io
            Reporter: Tsz Wo (Nicholas), SZE


When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Attachment: c6307_20091130.patch

c6307_20091130.patch: adds a new constructor with FSDataInputStream and length so that if the file is already opened, it won't be opened again inside the SequenceFile.Reader constructor.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784494#action_12784494 ] 

Arun C Murthy commented on HADOOP-6307:
---------------------------------------

+1

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780335#action_12780335 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6307:
------------------------------------------------

SequenceFile.Reader actually do not need the file length.  It compares the current position with the file length to check eof.  Instead, we may try-catch EOFException.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764276#action_12764276 ] 

Hong Tang commented on HADOOP-6307:
-----------------------------------

Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller).

I have previously asked around why we cannot do "seek(Long.MAX_VALUE); return getPos();" to determine the length of the file, and I was told that you could actually seek beyond the end of the file. TRUE or FALSE?

Another question is why we cannot support seek like POSIX does, where we can seek in relation to the {begin,current,end} position of the file?

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783943#action_12783943 ] 

Hadoop QA commented on HADOOP-6307:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426467/c6307_20091130.patch
  against trunk revision 885534.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/153/console

This message is automatically generated.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784768#action_12784768 ] 

Hudson commented on HADOOP-6307:
--------------------------------

Integrated in Hadoop-Common-trunk #175 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/175/])
    . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.


> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784534#action_12784534 ] 

Hudson commented on HADOOP-6307:
--------------------------------

Integrated in Hadoop-Common-trunk-Commit #95 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/95/])
    . Add a new SequenceFile.Reader constructor in order to support reading on un-closed file.


> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Attachment: c6307_20091124.patch

c6307_20091124.patch: introduces a new public constructor.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I have committed this to 0.21 and above.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Attachment: c6307_20091201.patch

Thanks, Arun!

c6307_20091201.patch: updated with the review comment.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE
      Status: Patch Available  (was: Open)

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-6307:
----------------------------------

    Status: Open  (was: Patch Available)

Only nit:

+    public Reader(Path file, FSDataInputStream in, int buffersize,
+        long start, long length, Configuration conf) throws IOException

should be

+    public Reader(FSDataInputStream in, int buffersize,
+        long start, long length, Configuration conf) throws IOException

Having the 'file' there is useless since we do not use it in the constructor, people might get confused about usage, or worse assume that we will open the file again. The proposed alternative will force people to think in the right direction i.e. they open the file and hand us the input-stream and the start/length.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6307_20091124.patch, c6307_20091130.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782220#action_12782220 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6307:
------------------------------------------------

> SequenceFile.Reader actually do not need the file length. ...

Thanks Chris and Arun pointing out that the file length (i.e. SequenceFile.Reader.end) cannot be removed.  Otherwise, SequenceFile.Sorter won't work.

I guess we have to introduce a new public constructor, which takes length as a parameter.  So, that user applications could possibly pass the correct length when creating a new Reader.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-6307:
-------------------------------------------

    Fix Version/s: 0.22.0
                   0.21.0
     Hadoop Flags: [Reviewed]
           Status: Patch Available  (was: Open)

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784512#action_12784512 ] 

Hadoop QA commented on HADOOP-6307:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12426592/c6307_20091201.patch
  against trunk revision 885888.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/157/console

This message is automatically generated.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: c6307_20091124.patch, c6307_20091130.patch, c6307_20091201.patch
>
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764758#action_12764758 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6307:
------------------------------------------------

> Not sure why this issue only hits SequenceFile. The problem applies equally to TFile (although this was pushed to the caller).

This problem applies to any implementation which gets the un-closed file length by calling fs.getFileStatus(file).getLen().  (By "problem", I mean that the reader may not see all hflushed bytes.  It sees some part of the file.  This is the same behavior before append.)  I did not check TFile before.  TFile does not have this problem if the caller manage to get the correct length and pass it to the TFile.Reader constructor.

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764752#action_12764752 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6307:
------------------------------------------------

> Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself?

Actually, fs.getFileStatus(file).getLen() requires only "x" access on the parent directory but not "r".

SequenceFile.Reader opens the file for read.  So we must have (and already have) "InputStream in =  fs.open(file);" in the codes.  My previous suggestion was to call "in.available()" to get the number of available bytes.  If we replace "fs.open(file).available()" by "in.available()", it indeed reduces an RPC to the NameNode and does not introduce any additional overhead.  (However, it currently does not work because of HDFS-691.)

FYR, copied the related code segment which includes all SequenceFile.Reader constructors below.
{code}
//line 1438, SequenceFile.java
    /** Open the named file. */
    public Reader(FileSystem fs, Path file, Configuration conf)
      throws IOException {
      this(fs, file, conf.getInt("io.file.buffer.size", 4096), conf, false);
    }

    private Reader(FileSystem fs, Path file, int bufferSize,
                   Configuration conf, boolean tempReader) throws IOException {
      this(fs, file, bufferSize, 0, fs.getFileStatus(file).getLen(), conf, tempReader);
    }
    
    private Reader(FileSystem fs, Path file, int bufferSize, long start,
                   long length, Configuration conf, boolean tempReader) 
    throws IOException {
      this.file = file;
      this.in = openFile(fs, file, bufferSize, length);
      this.conf = conf;
      boolean succeeded = false;
      try {
        seek(start);
        this.end = in.getPos() + length;
        init(tempReader);
        succeeded = true;
      } finally {
        if (!succeeded) {
          IOUtils.cleanup(LOG, in);
        }
      }
    }

    /**
     * Override this method to specialize the type of
     * {@link FSDataInputStream} returned.
     */
    protected FSDataInputStream openFile(FileSystem fs, Path file,
        int bufferSize, long length) throws IOException {
      return fs.open(file, bufferSize);
    }
{code}

> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6307) Support reading on un-closed SequenceFile

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764444#action_12764444 ] 

dhruba borthakur commented on HADOOP-6307:
------------------------------------------

Isn't it true that fs.getFileStatus(file).getLen(). requires read access on the parent directory whereas fs.open(file).available() required read access on the file itself?

Many map-reduce programs use SequenceFiles to store data. And they do not need the facility to process files that are currently being written to. In this case, isn't the additional overhead of fetching block locations via fs.open(file)  kinda wasteful? 



> Support reading on un-closed SequenceFile
> -----------------------------------------
>
>                 Key: HADOOP-6307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6307
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Tsz Wo (Nicholas), SZE
>
> When a SequenceFile.Reader is constructed, it calls fs.getFileStatus(file).getLen().  However, fs.getFileStatus(file).getLen() does not return the hflushed length for un-closed file since the Namenode does not know the hflushed length.  DFSClient have to ask a datanode for the length last block which is being written; see also HDFS-570.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.