You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/12/15 21:44:00 UTC

[jira] Created: (AVRO-712) define memcmp'able encoding

define memcmp'able encoding
---------------------------

                 Key: AVRO-712
                 URL: https://issues.apache.org/jira/browse/AVRO-712
             Project: Avro
          Issue Type: New Feature
          Components: spec
            Reporter: Doug Cutting


It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-712) define memcmp'able encoding

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691 ] 

Scott Carey commented on AVRO-712:
----------------------------------

Ints/Longs can be variable length, with the same space cost as before: The first byte in the sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.  The remaining encoding is like before -- zig-zag packed absolute value of the number.  However, if the number is negative we need to flip those bits as well as the 'is last byte in varint' bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7 bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each sequence of bytes that represents a character, it may already sort properly.  We just need a way to identify the end of the string.  I suppose trailing marker bytes or bits will work here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting (equal fields would be adjacent) but an equivalence test would fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space cost of encoding and decoding in many use cases would outweigh the savings of sorting and comparison.   For some languages, the cost of doing the comparison on the current binary format is either not too great or could be optimized quite a bit further than they are now (C/C++/Java).  More dynamic languages may benefit more from being able to call into an optimized memcmp routine since their conditional logic and method call overhead is greater.  For those languages, perhaps we should call out to an optimized Avro C or C++ stub for comparisons?  I wonder if we put our efforts into more optimized comparison, serialization, and deserialization if we would end up in a better place than adding a binary format.



> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-712) define memcmp'able encoding

Posted by "Adam Warrington (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974410#action_12974410 ] 

Adam Warrington commented on AVRO-712:
--------------------------------------

Here is a first stab at a definition. I've written a prototype using these rules in python, and will attach it to this ticket. The spec I've defined here is not nearly as efficient in both space and time to encode/decode as the binary encoding currently developed in Avro. It's support of features like the "ignored" and "decreasing" attributes on record fields is also lacking.

Spec:

int and long data is encoded by flipping the sign bit, and encoding as big endian.

============================================

float and double data is encoded by flipping the sign bit, and if it is a negative number, taking the compliment of all bits lower than the sign bit. Storing the 4 or 8 bytes as a big endian int or long.

=============================================

bytes: bytes should have a delimiter marking the end of the byte array. This value should be the lowest possible value, since the ordering of two byte arrays equal up to one's delimiter should always order the shorter one less than the longer one. If one were to encode a byte of value 0x00 as two bytes 0x00 0x01, and one were to encode end of byte array as 0x00 0x00, this property would hold true. Some examples:

Alphabet 0-7
String1: 012020 -> 0112012100
String2: 01202 -> 011201200
String2 is less than String1

String1: 012021 -> 0112012100
String2: 012020 -> 01120120100
String2 is less than String1

============================

strings: encoded as bytes defined above, ordering should hold

============================

boolean, null: encoded same as binary encoder

============================

maps: Not supported. Attempting to encode a map with a memcmp DatumWriter should throw an exception in my opinion.

=============================

enum: Encoded the same as DatumWriter using the memcmp encoder, since enums with lesser ordinals come before enums with greater ordinals.

=============================

arrays: I don't have a great answer here. One thing that would work, but isn't space efficient, is to write out a byte after every element in the array indicating whether it is the last element or not, where 0x00 indicates last element, and 0x01 indicates not. I've done this in a prototype I'll attach to this ticket.

=============================

record: Before writing out the fields in a record, the fields should be ordered lexicographically by name. Ignored fields either shouldn't be written (makes the encoder lossy), or the encoder should throw an exception (something I've done in a prototype). In the prototype I created, I also don't handle the "decreasing" field attribute, but this could be done by allowing the encoder to take an optional parameter specifying whether to encode the value in decreasing, which would invert all the bits before returning the encoded bytes.

=============================

union: Encoded the same way as a DatumWriter using a memcmp encoder would encode it currently, first writing out the ordinal of the union using the memcmp encoder, then using the encoder to encode the value.

> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-712) define memcmp'able encoding

Posted by "Adam Warrington (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Warrington updated AVRO-712:
---------------------------------

    Attachment: memcmp_encoding_prototype.py

Prototype demonstrating the spec I defined above

> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-712) define memcmp'able encoding

Posted by "Adam Warrington (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974700#action_12974700 ] 

Adam Warrington commented on AVRO-712:
--------------------------------------

Scott Carey:

I think some usefulness can come from the ability to use Avro entities with systems that use memcmp to sort binary data. For example, keys in an HBase table. One could create multi-component keys for an HBase table using Avro, and have guarantees about how their data is to be sorted. Say I'm storing blog posts in HBase and want to group blog posts over time by domain within a table. I could create an avro schema:

{ "type": "record",
  "name": "author_blog_key",
  "fields": [
    { "name": "domain", "type": "string" },
    { "name": "timestamp": "type": "long" }
  ]}

If the memcmp sort ordering could be guaranteed, I can used serialized instances of this within systems that deal with sorted data using memcmp.

It's clear that the time/space costs are going to be negatively impacted with this type of encoding, especially dealing with bytes/strings/arrays. Your proposed encoding of Ints and Longs is clever, and I like the idea of putting ignored fields at the end of a record if equivalency isn't required (which in many cases it isn't).


> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-712) define memcmp'able encoding

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975662#action_12975662 ] 

Doug Cutting commented on AVRO-712:
-----------------------------------

Scott> Overall, I am not convinced such an encoding would have that much value.

A use case is systems that don't permit extensible, user-specified comparators.  Does HBase?



> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (AVRO-712) define memcmp'able encoding

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974691#action_12974691 ] 

Scott Carey edited comment on AVRO-712 at 12/23/10 1:47 PM:
------------------------------------------------------------

Ints/Longs can be variable length, with the same space cost as before: The first byte in the sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.  The remaining encoding is like before -- zig-zag packed absolute value of the number.  However, if the number is negative we need to flip those bits as well as the 'is last byte in varint' bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7 bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each sequence of bytes that represents a character, it may already sort properly.  We just need a way to identify the end of the string.  I suppose trailing marker bytes or bits will work here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting (equal fields would be adjacent) but an equivalence test would fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space cost of encoding and decoding in many use cases would outweigh the savings of sorting and comparison.   For some languages, the cost of doing the comparison on the current binary format is either not too great or could be optimized quite a bit further than they are now (C/C+\+/Java).  More dynamic languages may benefit more from being able to call into an optimized memcmp routine since their conditional logic and method call overhead is greater.  For those languages, perhaps we should call out to an optimized Avro C or C+\+ stub for comparisons?  I wonder if we put our efforts into more optimized comparison, serialization, and deserialization if we would end up in a better place than adding a binary format.



      was (Author: scott_carey):
    Ints/Longs can be variable length, with the same space cost as before: The first byte in the sequence has the sign bit first, inverted (0 for negative numbers, 1 for positive).  The second bit is the 'is last byte in varint' bit -- 1 if there are more, 0 if it is the last byte.  The remaining encoding is like before -- zig-zag packed absolute value of the number.  However, if the number is negative we need to flip those bits as well as the 'is last byte in varint' bits.  Both encoding and decoding are more expensive this way, but memcmp would work and the space used would be the same.

Bytes/Strings/Arrays:
All are prefixed with a length in the current format.  you can't prefix with a length and get "abc" < "abcd" as well as "abc" < "b".  Lexicographic ordering of variable length fields is hard.  We could use a similar tactic to the varints above for Bytes and store 7 bits per byte, with a trailing marker bit indicating the end of the sequence.  This takes more space, and is definitely more expensive to serialize/deserialize.

I think UTF-8's binary properties might be beneficial here for Strings.  At least for each sequence of bytes that represents a character, it may already sort properly.  We just need a way to identify the end of the string.  I suppose trailing marker bytes or bits will work here too, with differing space/time tradeoffs.

Arrays can't use the trailing marker bit, a trailing marker byte per element would work.

Records:  I think if "ignored" fields were placed at the end, it would just work for sorting (equal fields would be adjacent) but an equivalence test would fail (one processes as > when they are equal).  



Overall, I am not convinced such an encoding would have that much value.  The extra time/space cost of encoding and decoding in many use cases would outweigh the savings of sorting and comparison.   For some languages, the cost of doing the comparison on the current binary format is either not too great or could be optimized quite a bit further than they are now (C/C++/Java).  More dynamic languages may benefit more from being able to call into an optimized memcmp routine since their conditional logic and method call overhead is greater.  For those languages, perhaps we should call out to an optimized Avro C or C++ stub for comparisons?  I wonder if we put our efforts into more optimized comparison, serialization, and deserialization if we would end up in a better place than adding a binary format.


  
> define memcmp'able encoding
> ---------------------------
>
>                 Key: AVRO-712
>                 URL: https://issues.apache.org/jira/browse/AVRO-712
>             Project: Avro
>          Issue Type: New Feature
>          Components: spec
>            Reporter: Doug Cutting
>         Attachments: memcmp_encoding_prototype.py
>
>
> It would be useful to have an encoding for Avro data that ordered data according to the Avro specification under memcmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.