You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/03/02 07:46:50 UTC

[jira] Created: (HADOOP-2921) align map splits on sorted files with key boundaries

align map splits on sorted files with key boundaries
----------------------------------------------------

                 Key: HADOOP-2921
                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
             Project: Hadoop Core
          Issue Type: New Feature
    Affects Versions: 0.16.0
            Reporter: Joydeep Sen Sarma


(this is something that we have implemented in the application layer - may be useful to have in hadoop itself).

long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.

this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.

the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574693#action_12574693 ] 

Doug Cutting commented on HADOOP-2921:
--------------------------------------

We could implement this by adding an abstract method to SequenceFileRecordReader that's called when it is first opened and that could scan forward to a key boundary, right?  Then one could define a subclass of SequenceFileInputFormat that uses this RecordReader.  Similarly for TextInputFormat.

> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile)
[ ... ]
> we don't use the key at all - the sort field is embedded in the value itself.

Side note: wouldn't it make more sense to not use the value and to just sort on part of the key?  Then you could pass a Comparator to SequenceFile and the definition of the sort key is the same.  We already have a generic means for specifying sort keys.  I don't see the need for a new one.  Why do you prefer using values to keys?



> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574246#action_12574246 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

as i mentioned in a different jira - we don't use the key at all - the sort field is embedded in the value itself. an interface like the partitioner interface - that takes both the key and the value and returns an object  would do the job for us. (the reader can invoke the equals method to determine the boundaries).

yeah - we shouldn't change the default semantics of the current reader - either have an option that alters the semantics or a new reader.

what about text files? we don't use them much directly (always embed in sequencefiles) - but i imagine other folks do and the same considerations can apply ..

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575085#action_12575085 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

btw - this also becomes a stepping stone  to optimize joins based on the sort column (turning off map sorts and aligning maps with sort boundaries allows the map-reduce to become a pure merge-join in the reducer). I left some thoughts on https://issues.apache.org/jira/browse/HADOOP-2085




> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574225#action_12574225 ] 

Runping Qi commented on HADOOP-2921:
------------------------------------

I see.

+1  to make the current SequenceFileRecordReader to do the key boundry check,
or implement a new record reader just for that.



> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574220#action_12574220 ] 

Runping Qi commented on HADOOP-2921:
------------------------------------


Did you achieve boundry preserving by subclassing SequenceFileInputformat and overwriting the getSplit() method?



> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575093#action_12575093 ] 

Doug Cutting commented on HADOOP-2921:
--------------------------------------

> where the value encodes a row. The sort field is embedded inside the row

What I'm suggesting is that, rather than a null key, you use a null value and put the row in the key.  Why doesn't that work for you?  Then you can use existing key-oriented tools for sorting.


> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574245#action_12574245 ] 

Owen O'Malley commented on HADOOP-2921:
---------------------------------------

I don't think changing the semantics of the current seqeunce file record reader to do this is a good idea. In the degenerate case, you could end up with a lot of your maps having no inputs.

Joydeep, would a grouping comparator like the one we use to group the reduce inputs work here? I assume it is the case that you'd want to group on a subset of the fields in the keys, since that controls the sort.


> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574223#action_12574223 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

no - didn't override getSplit. i have an inputformat that opens sequencefile readers for two splits. one is the split handed down from the map task. the other is a split that contains the rest of the file (positioned after the map split). 

we skip the first set of records in the map split (unless starting at offset 0). and we process the first set of records in the next split. (ditto as how sequencefiles work with sync markers - using sort key boundaries as sync positions instead)

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574224#action_12574224 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

oh btw - the reason for doing it like this was that i wouldn't have been able to do this by subclassing sequencefileinputformat itself. most of the important variables are private - and i didn't want to change the core code. so tried to keep it in the app layer.

but obviously - would be more efficient to implement in the sequencefile code itself.

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2921) align map splits on sorted files with key boundaries

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575073#action_12575073 ] 

Joydeep Sen Sarma commented on HADOOP-2921:
-------------------------------------------

> Why do you prefer using values to keys?

 we don't use keys at all. We are using Hadoop as a row oriented database - where the value encodes a row. The sort field is embedded inside the row (ie. value) itself and it would be redundant to store it in the key. So we save space and don't put it there. JAQL (and i believe Cascading) also do the same. I am not sure about Pig.

The Partitioner interface also allows partitioning based on key and value - so there seems to be a precedent here. 

> align map splits on sorted files with key boundaries
> ----------------------------------------------------
>
>                 Key: HADOOP-2921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2921
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.16.0
>            Reporter: Joydeep Sen Sarma
>
> (this is something that we have implemented in the application layer - may be useful to have in hadoop itself).
> long term log storage systems often keep data sorted (by some sort-key). future computations on such files can often benefit from this sort order. if the job requires grouping by the sort-key - then it should be possible to do reduction in the map stage itself.
> this is not natively supported by hadoop (except in the degenerate case of 1 map file per task) since splits can span the sort-key. however aligning the data read by the map task  to sort key boundaries is straightforward - and this would be a useful capability to have in hadoop.
> the definition of the sort key should be left up to the application (it's not necessarily the key field in a Sequencefile) through a generic interface - but otherwise - the sequencefile and text file readers can use the extracted sort key to align map task data with key boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.