You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2010/12/23 19:48:45 UTC

[jira] Issue Comment Edited: (AVRO-712) define memcmp'able encoding

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691 ] 

Scott Carey edited comment on AVRO-712 at 12/23/10 1:47 PM:
------------------------------------------------------------

Ints/Longs can be variable length, with the same space cost as before: The first byte in the sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.  The remaining encoding is like before -- zig-zag packed absolute value of the number.  However, if the number is negative we need to flip those bits as well as the 'is last byte in varint' bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7 bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each sequence of bytes that represents a character, it may already sort properly.  We just need a way to identify the end of the string.  I suppose trailing marker bytes or bits will work here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting (equal fields would be adjacent) but an equivalence test would fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space cost of encoding and decoding in many use cases would outweigh the savings of sorting and comparison.   For some languages, the cost of doing the comparison on the current binary format is either not too great or could be optimized quite a bit further than they are now (C/C+\+/Java).  More dynamic languages may benefit more from being able to call into an optimized memcmp routine since their conditional logic and method call overhead is greater.  For those languages, perhaps we should call out to an optimized Avro C or C+\+ stub for comparisons?  I wonder if we put our efforts into more optimized comparison, serialization, and deserialization if we would end up in a better place than adding a binary format.



      was (Author: scott_carey):
    Ints/Longs can be variable length, with the same space cost as before: The first byte in the sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.  The remaining encoding is like before -- zig-zag packed absolute value of the number.  However, if the number is negative we need to flip those bits as well as the 'is last byte in varint' bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7 bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each sequence of bytes that represents a character, it may already sort properly.  We just need a way to identify the end of the string.  I suppose trailing marker bytes or bits will work here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting (equal fields would be adjacent) but an equivalence test would fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space cost of encoding and decoding in many use cases would outweigh the savings of sorting and comparison.   For some languages, the cost of doing the comparison on the current binary format is either not too great or could be optimized quite a bit further than they are now (C/C++/Java).  More dynamic languages may benefit more from being able to call into an optimized memcmp routine since their conditional logic and method call overhead is greater.  For those languages, perhaps we should call out to an optimized Avro C or C++ stub for comparisons?  I wonder if we put our efforts into more optimized comparison, serialization, and deserialization if we would end up in a better place than adding a binary format.


  
> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.