You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org> on 2010/06/06 11:52:58 UTC

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

     [ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1295:
------------------------------------------------

    Attachment: PIG-1295_0.1.patch

Here is my frist dratf for the binary comparator, together with some simple tests.

This comparator should replace PigTupleRawComparator.

I assume that the tuple we are comparing is a NullableTuple that wraps a DefaultTuple.
There is a comment in the old comparator that says:
{code}
// Users are allowed to implement their own versions of tuples, which means we have no
// idea what the underlying representation is.
{code}
This will need to be addressed. If I can know that the tuple is not a DefaultTuple looking at its data type byte, I can switch back to object deserialization. This means that the user must not use the same DataType.TUPLE as the default tuple.

The comparator iterates over the serialized tuples and compares them following the same logic as the old comparator (first check if Null, then their sizes, then compare field by field, if the fields are of different types compare the datatypes, else compare the values).
For now I implemented the logic only for simple types, I plan to add bytearrays and chararrays. I do not plan to add support for complex datatypes. In this case I will fall back to object deserialization.

The implementation uses a ByteBuffer to easily iterate over the values. I don't know if it is acceptable or its performance impact but it is very handy. The alternative is manual bookkeeping of a cursor inside the byte arrays.
To do this I would probably use a map that tells me how many bytes each data type uses. This map should probably go somewhere else though (DataType? DataReaderWriter?). I commented out an initial attempt for this at the end of the class.
An alternative implementation could be to follow the strategy of the others PigRawComparators, that wrap a comparator from Hadoop. This would require keeping a number of comparators in memory, though. I commented out this approach at the beginning of the class.

TODO:
Implement fallback behaviour.
Implement support for bytearrays and chararrays.
Graceful handling of tuples different from DefaultTuple.
Add performance tests.

Any comment is more than welcome.

> Binary comparator for secondary sort
> ------------------------------------
>
>                 Key: PIG-1295
>                 URL: https://issues.apache.org/jira/browse/PIG-1295
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>         Attachments: PIG-1295_0.1.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is "group by" followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.