You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2008/09/23 21:31:44 UTC

[jira] Created: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.

PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
----------------------------------------------------------------------------------------

                 Key: PIG-450
                 URL: https://issues.apache.org/jira/browse/PIG-450
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: types_branch
            Reporter: Alan Gates
            Assignee: Alan Gates
             Fix For: types_branch


In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key.  This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-450:
---------------------------

    Status: Patch Available  (was: Open)

> PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-450
>                 URL: https://issues.apache.org/jira/browse/PIG-450
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key.  This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-450:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Patch checked in.

> PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-450
>                 URL: https://issues.apache.org/jira/browse/PIG-450
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key.  This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-450:
---------------------------

    Attachment: PIG-450.patch

This patch adds a combiner step to distincts that just removes the duplicate values so that less data is carried across from map to reduce.  Here are the resulting time differences (all times in seconds):

||Num records||Num keys||Num reducers||1.4 || 2.0 || 2.0 with this patch ||
| 200M | 60 | 1 | 2547 | 1388 | 142 |
| 200M | 16M | 50 | 384 | 227 | 231 |

The main benefit is with a small number of keys, but there does not appear to be a penalty with a larger number of keys.



> PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-450
>                 URL: https://issues.apache.org/jira/browse/PIG-450
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key.  This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633986#action_12633986 ] 

Olga Natkovich commented on PIG-450:
------------------------------------

+1

> PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-450
>                 URL: https://issues.apache.org/jira/browse/PIG-450
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key.  This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.