You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Dick King (JIRA)" <ji...@apache.org> on 2009/04/18 03:21:14 UTC

[jira] Created: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
----------------------------------------------------------------------

                 Key: HADOOP-5705
                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.19.1
            Reporter: Dick King


With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702764#action_12702764 ] 

Hudson commented on HADOOP-5705:
--------------------------------

Integrated in Hadoop-trunk #817 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/817/])
    . Improve TotalOrderPartitioner efficiency by updating the trie construction. Contributed by Dick King.


> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>             Fix For: 0.21.0
>
>         Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5705:
----------------------------------

    Assignee: Dick King
      Status: Patch Available  (was: Open)

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5705:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

+1

I committed this. Thanks, Dick

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>             Fix For: 0.21.0
>
>         Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Dick King (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dick King updated HADOOP-5705:
------------------------------

    Attachment: hadoop-5705-no-names.patch

Fixed the patch to remove a comment that contained my name

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701767#action_12701767 ] 

Hadoop QA commented on HADOOP-5705:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12405833/hadoop-5705.patch
  against trunk revision 767643.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/228/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/228/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/228/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/228/console

This message is automatically generated.

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Dick King (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dick King updated HADOOP-5705:
------------------------------


The patch is ready.

I don't know how to mark the status of the issue as "patch available".  Could someone tell me?

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700431#action_12700431 ] 

Hong Tang commented on HADOOP-5705:
-----------------------------------

Dick, just click "submit patch" will change the status to "patch available".

Could you be sure to the patch applies to trunk (instead of 0.19)?

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Dick King (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dick King updated HADOOP-5705:
------------------------------

    Attachment: hadoop-5705.patch

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700429#action_12700429 ] 

Jakob Homan commented on HADOOP-5705:
-------------------------------------

Under Available Workflow Actions, hit Submit Patch.

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Dick King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702097#action_12702097 ] 

Dick King commented on HADOOP-5705:
-----------------------------------

No new tests are needed by this patch because it improves performance without causing any changes to the behavior of the total ordering partitioner.  

If you believe that tries are worthwhile at all, then you must believe that it's worthwhile to run the tries as deeply as possible.  Java's recursion stack depth is 500 IIRC, so I set the depth limit to 200.  I surmise that almost all input split set adjacent element pairs differ before the 200th byte but that cosuming 201 stack frames won't break very many contexts where this is called.

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: hadoop-5705-no-names.patch, hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5705) Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.

Posted by "Dick King (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700882#action_12700882 ] 

Dick King commented on HADOOP-5705:
-----------------------------------

The patch applies to trunk -- and 0.20 -- and 0.19 .  In fact, it's only really shaken out in 0.21 .

I didn't mean to give the impression that it was version specific.  I probably made a mistake when I created JIRA.

> Improved tries in TotalOrderPartitioner to eliminate large leaf nodes.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-5705
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5705
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.1
>            Reporter: Dick King
>            Assignee: Dick King
>         Attachments: hadoop-5705.patch
>
>
> With the old technology, if a particular node in the trie has many children that contain no split points, TotalOrderPartitioner creates a separate empty leaf node for each one.  This takes a lot of space, which in turn limits the depth to which we can grow these trees to avoid making too many sparse nodes, so there is a parameter that defaults to 2 to control the depth.  With this patch, I can guarantee that each split point will create only three trie nodes in the trie: an empty one, a leaf that contains only one split point, and the internal nodes as needed, for a total space of perhaps 50-200 bytes per split point, even if the entire trie is elaborated.  There are pathological cases that can cause a recursion overflow during creation, so i have set the default tree max depth to 200, but I expect almost all tries to be fully elaborated, which means in turn that each byte of the sought key gets touched at most twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.