You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by "Ivan Bella (Created) (JIRA)" <ji...@apache.org> on 2012/02/18 18:45:59 UTC

[jira] [Created] (ACCUMULO-418) Make RFiles splittable

Make RFiles splittable
----------------------

                 Key: ACCUMULO-418
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
             Project: Accumulo
          Issue Type: New Feature
          Components: master, tserver
    Affects Versions: 1.3.5, 1.4.0, 1.5.0, 1.3.6, 1.4.1, 1.5.0-SNAPSHOT
         Environment: All
            Reporter: Ivan Bella
            Assignee: Eric Newton
             Fix For: 1.3.5, 1.4.0, 1.5.0, 1.3.6, 1.4.1, 1.5.0-SNAPSHOT


There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-418) Make RFiles splittable

Posted by "John Vines (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Vines updated ACCUMULO-418:
--------------------------------

    Affects Version/s:     (was: 1.4.1)
                           (was: 1.3.6)
                           (was: 1.5.0)
        Fix Version/s:     (was: 1.5.0-SNAPSHOT)
                           (was: 1.3.5)
                           (was: 1.4.0)

Similar yet different. Both tickets can be implemented independently, but they benefit from one another.  So you're right on, Ivan
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Eric Newton
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.5.0, 1.3.6, 1.4.1
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-418) Make RFiles splittable

Posted by "Ivan Bella (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211385#comment-13211385 ] 

Ivan Bella commented on ACCUMULO-418:
-------------------------------------

If was not clear to me whether or not this is fully covered by ACCUMULO-387 (https://issues.apache.org/jira/browse/ACCUMULO-387).
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0, 1.3.6, 1.4.1, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Eric Newton
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.3.5, 1.4.0, 1.5.0, 1.3.6, 1.4.1, 1.5.0-SNAPSHOT
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-418) Make RFiles splittable

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-418:
----------------------------------

    Fix Version/s:     (was: 1.4.1)
                       (was: 1.3.6)
         Assignee: Keith Turner  (was: Eric Newton)

I suspect this will require modifications to the file format to make it more of a local operation.  With the current file format, the index (at the end of the file and possibly in a remote block) must be read.  
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Keith Turner
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.5.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-418) Make RFiles splittable

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212583#comment-13212583 ] 

Keith Turner commented on ACCUMULO-418:
---------------------------------------

Would you like to see this in 1.4.x? If so, then we could possibly create a reader that uses the current file format and reads the index at the end of the file.  For 1.5 we can change the file format to avoid reading the index.
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Keith Turner
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.5.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-418) Make RFiles splittable

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212677#comment-13212677 ] 

Keith Turner commented on ACCUMULO-418:
---------------------------------------

I took a look at how sequence file and sequence file input format handles this.  It writes a 128 bit sync marker between records in the file.  The sync marker is unique per file.  It reads the beginning of the file to read the sync marker and other info. Then for the HDFS block you are interested in it start scanning looking for the 128 bit sync marker. Once it finds that it keeps reading records until it passes the block boundary.  So sequence file does some non local reads for the beginning fo the file.

In 1.4 w/ the multilevel rfile index, reading the index may involve more than reading the end of the file.  If the index is more than one level, then the nodes of the tree are sprinkled throughout the file. 
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Keith Turner
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.5.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-418) Make RFiles splittable

Posted by "Ivan Bella (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212946#comment-13212946 ] 

Ivan Bella commented on ACCUMULO-418:
-------------------------------------

I think waiting until 1.5 would be fine in this case.  The need is not urgent enough (at least for me) to require additional work to pound this into 1.4.
                
> Make RFiles splittable
> ----------------------
>
>                 Key: ACCUMULO-418
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-418
>             Project: Accumulo
>          Issue Type: New Feature
>          Components: master, tserver
>    Affects Versions: 1.3.5, 1.4.0, 1.5.0-SNAPSHOT
>         Environment: All
>            Reporter: Ivan Bella
>            Assignee: Keith Turner
>              Labels: RFile, hadoop, mapreduce
>             Fix For: 1.5.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There are times when iterating over RFiles is useful in map-reduce jobs.  I know that RFiles logically can be split on the block boundary, however there is no easy way to do this currently as there is no RFile RecordReader or InputFormat provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira