You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/05/24 22:37:13 UTC

[jira] [Updated] (PIG-4652) [Pig on Tez] Key Comparison is slower than mapreduce

     [ https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy updated PIG-4652:
------------------------------------
    Fix Version/s:     (was: 0.16.0)
                   0.17.0

> [Pig on Tez] Key Comparison is slower than mapreduce
> ----------------------------------------------------
>
>                 Key: PIG-4652
>                 URL: https://issues.apache.org/jira/browse/PIG-4652
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.17.0
>
>
> Tez is using PigTupleSortComparator on both map and reduce side and in POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map and reduce side for comparing tuples which is byte only comparison and very fast.  It then uses PigGrouping<DataType>WritableComparator as the grouping comparator to correctly group those keys. 
>   It is not possible to use similar method in Tez (PigTupleWritableComparator for output and input and PigTupleSortComparator in POShuffleTezLoad), without addition of APIs in Tez to get raw bytes of the keys. Because when we compare multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be compared to maintain the same order as the mapside. In mapreduce, there was only single input and mapreduce framework sorted them together. But in Tez, the join inputs are sorted separately and the application only gets the serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as well which can be used in POShuffleTezLoad for min key comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)