You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Yan Zhou (JIRA)" <ji...@apache.org> on 2010/12/08 00:05:17 UTC

[jira] Created: (PIG-1757) After split combination, the number of maps may vary slightly

After split combination, the number of maps may vary slightly
-------------------------------------------------------------

                 Key: PIG-1757
                 URL: https://issues.apache.org/jira/browse/PIG-1757
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Yan Zhou
            Priority: Minor
             Fix For: 0.9.0


The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.

One solution is to sort the host lists from the component splits so as to get consistent number of maps.

I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Zhou reassigned PIG-1757:
-----------------------------

    Assignee: Yan Zhou

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1757:
--------------------------------

    Fix Version/s: 0.8.0

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Zhou resolved PIG-1757.
---------------------------

    Resolution: Fixed

Committed to the trunk.

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969832#action_12969832 ] 

Richard Ding commented on PIG-1757:
-----------------------------------

+1

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1757:
--------------------------------

    Fix Version/s:     (was: 0.8.0)

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1757) After split combination, the number of maps may vary slightly

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Zhou updated PIG-1757:
--------------------------

    Attachment: PIG-1757.patch

test-core runs ok; test-patch is clean except for lack of test case which is ok for this trivial change and difficulty to run on a local cluster.

> After split combination, the number of maps may vary slightly
> -------------------------------------------------------------
>
>                 Key: PIG-1757
>                 URL: https://issues.apache.org/jira/browse/PIG-1757
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: PIG-1757.patch
>
>
> The split combination, introduced in 0.8 by PIG-1518, may see small variations in number of maps. For instance, PigMix2's L4 query experiences a variation  of 901 or 902 maps in a test cluster. The reason is that the BlockLocation's getHosts
> method, used in FileInputFormat's spli generation, returns a list of hosts that hold the block. However the ordering of the list is not deterministic. Pig's split combination is not immune to such a random ordering since the combination decision is based upon the hosts that hold as many data local to a map as possible, and there is no specific tie-breaking rule to force a particular ordering. In some benchmarking or performance baselining tests, these variations, however small they are, might not be desirable.
> One solution is to sort the host lists from the component splits so as to get consistent number of maps.
> I suspect that other split combination techniques that make use of the data host info to maximize the data locality in each map, like CombineFileInputFormat, might have had the similar  variations of number of maps.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.