You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michael Dalton <mw...@gmail.com> on 2011/04/14 02:55:02 UTC

just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Hi all,

I'm with a startup, GotoMetrics, doing things with Hadoop  and I've gotten
permission to open source Orderly -- our row key schema system for use in
projects like HBase. Orderly allows you to serialize common data types
(long, double, bigdecimal, etc) or structs/records of these types to byte
arrays, and ensures that the byte arrays sort in the same natural order as
the data type. You may then use the byte arrays as keys in HBase (or any
sorted, byte-typed key-value store).

 I'd really appreciate feedback about what parts or useful (or not useful),
and if this would be something that would be appropriate to submit as a
contrib to HBase itself (or if people would prefer me to submit derivative
work to add composite row keys to Hive/Pig/etc).

Here are the interesting features:

   - All types are serialized a byte array that sorts in the natural order
   of the underlying key for all key values (e.g., an Integer row key will sort
   correctly for negative/positive values, a double will sort correctly for
   negative/positive/zero/infinity/negative infinity/subnormals/etc - any valid
   value)
   - Both ascending and descending sort order are supported for all types
   - Designed for space efficiency - tricks like using the end of a byte
   array instead of a terminator byte, variable-length types whenever possible,
   etc are all employed to minimize serialization length
   - Support for row key prefixes/suffixes to combine with your own custom
   encodings
   - Variable-length integers (similar in theory to Zig-Zag encoding) are
   supported, and their byte serialization preserves sort ordering
   - BigDecimal support (like all other types, with sort ordering-preserving
   byte serialization). To the best of my knowledge the first byte-sortable
   BigDecimal serialization.
   - Float/Double
   - UTF-8 strings (with support for empty string, NULL, etc)
   - Almost all types encode NULL, and do so without using additional space
   (e.g., by using transformation on invalid UTF-8 encodings for Strings, NaNs
   removed during NaN canonicalization for doubles, etc). Null comparess less
   than any non-null value
   - Support for struct (composite) row keys with an arbitrary number of
   fields. Each field may have its own sort order. Structs are sorted by field
   value.

I have the code up on github at  http://github.com/mwdalton/orderly. There
are javadocs for all the row key types explaining their serialization format
and performance characteristics (start with the RowKey and StructRowKey
docs), as well as example code in src/example.

Please let me know if you have any questions or if there's anything that
would be useful to add/change. Thanks!

Best regards,

Mike

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Michael Dalton <mw...@gmail.com>.

Thanks, will do!

Best,

Mike

On Wed, Nov 16, 2011 at 10:14 AM, Stack <st...@duboce.net> wrote:

> On Wed, Nov 16, 2011 at 12:32 AM, Michael Dalton <mw...@gmail.com>
> wrote:
> > Hi Denis,
> >
> > Yeah we got a corporate github account, it's now at
> > https://github.com/zettaset/orderly . Sorry for the confusion
> >
>
> Consider sticking a link to Orderly up here Michael:
> http://wiki.apache.org/hadoop/SupportingProjects
> Yours,
> St.Ack
>

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Stack <st...@duboce.net>.

On Wed, Nov 16, 2011 at 12:32 AM, Michael Dalton <mw...@gmail.com> wrote:
> Hi Denis,
>
> Yeah we got a corporate github account, it's now at
> https://github.com/zettaset/orderly . Sorry for the confusion
>

Consider sticking a link to Orderly up here Michael:
http://wiki.apache.org/hadoop/SupportingProjects
Yours,
St.Ack

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Michael Dalton <mw...@gmail.com>.

Hi Denis,

Yeah we got a corporate github account, it's now at
https://github.com/zettaset/orderly . Sorry for the confusion

Best,

Mike

On Wed, Nov 16, 2011 at 12:24 AM, Denis Kreis <de...@gmail.com> wrote:

> Hi Mike,
>
> whats about the project? Was it moved to another location?
>
> Best regards,
> Denis
>
>
>
>
>

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Denis Kreis <de...@gmail.com>.

Hi Mike,

whats about the project? Was it moved to another location?

Best regards,
Denis

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Ted Dunning <td...@maprtech.com>.

This is a subtle and clever point.

On Wed, Apr 13, 2011 at 11:25 PM, Michael Dalton <mw...@gmail.com> wrote:

> Avro avoids deserialization when sorting their data, but they use custom
> byte array comparators for different types. All of our encodings, including
> struct/record types, actually sort if you just compare the raw bytes using
> Bytes.compareTo.

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Michael Dalton <mw...@gmail.com>.

Hi Ted,

Thanks for pointing that out, I hadn't read the Avro sorted specs recently.
It looks like there's some overlap at a high-level (providing byte array
representations that can be sorted without deserialization). After glancing
at BinaryEncoder/Decoder in Avro, it looks to me like the differences are:

   - Avro avoids deserialization when sorting their data, but they use
   custom byte array comparators for different types. All of our encodings,
   including struct/record types, actually sort if you just compare the raw
   bytes using Bytes.compareTo. You can directly use the serialized byte values
   from this project in HBase without requiring HBase to implement its own
   custom comparator functions (it appears that for Avro key support, you'd
   need to parse Avro's schemas and have a custom comparator defined for each
   data type, and this would be used in HBase's sorting functions). You can
   drop Orderly's row keys into HBase without modifying the code base at all.
   - Per the above point, the actual serialization algorithms we use are
   quite different as we can't rely on custom comparator functions -- just
   Bytes.compareTo comparing raw bytes. The serializations end up with similar
   goals (i.e., variable-length zig-zag integers) but the implementation and
   algorithms are very different.
   - Very slightly more compact encodings in certain situations for some
   types -- our Strings don't require an integer length, they use a terminator
   byte, and in ascending sort don't even require the terminator byte. Our
   variable-length integers have some very minor length differences (by a bit
   or two) in some larger variable length long serializations. Probably not
   enough to really matter in all honesty.

Avro is very cool, and for a general serialization and RPC platform it's
definitely fantastic. Orderly is more of a focused solution on producing
byte arrays for use in projects like HBase, without requiring those projects
to integrate a serialization system. If you have any more questions or have
some features I missed that I should be contrasting, let me know.

Best regards,

Mike

On Wed, Apr 13, 2011 at 8:07 PM, Ted Dunning <td...@maprtech.com> wrote:

> Michael,
>
> Interesting contribution to the open source community.  Sounds like nice
> work.
>
> Can you say how this relates to Avro with regard to collating of binary
> data?
>
> See, for instance, here:
> http://avro.apache.org/docs/current/spec.html#order
>
>
> On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <mw...@gmail.com>wrote:
>
>> Hi all,
>>
>> I'm with a startup, GotoMetrics, doing things with Hadoop  and I've gotten
>> permission to open source Orderly -- our row key schema system for use in
>> projects like HBase. Orderly allows you to serialize common data types
>> (long, double, bigdecimal, etc) or structs/records of these types to byte
>> arrays, and ensures that the byte arrays sort in the same natural order as
>> the data type. You may then use the byte arrays as keys in HBase (or any
>> sorted, byte-typed key-value store).
>>
>>  I'd really appreciate feedback about what parts or useful (or not
>> useful),
>> and if this would be something that would be appropriate to submit as a
>> contrib to HBase itself (or if people would prefer me to submit derivative
>> work to add composite row keys to Hive/Pig/etc).
>>
>> Here are the interesting features:
>>
>>   - All types are serialized a byte array that sorts in the natural order
>>   of the underlying key for all key values (e.g., an Integer row key will
>> sort
>>   correctly for negative/positive values, a double will sort correctly for
>>   negative/positive/zero/infinity/negative infinity/subnormals/etc - any
>> valid
>>   value)
>>   - Both ascending and descending sort order are supported for all types
>>   - Designed for space efficiency - tricks like using the end of a byte
>>   array instead of a terminator byte, variable-length types whenever
>> possible,
>>   etc are all employed to minimize serialization length
>>   - Support for row key prefixes/suffixes to combine with your own custom
>>   encodings
>>   - Variable-length integers (similar in theory to Zig-Zag encoding) are
>>   supported, and their byte serialization preserves sort ordering
>>   - BigDecimal support (like all other types, with sort
>> ordering-preserving
>>   byte serialization). To the best of my knowledge the first byte-sortable
>>   BigDecimal serialization.
>>   - Float/Double
>>   - UTF-8 strings (with support for empty string, NULL, etc)
>>   - Almost all types encode NULL, and do so without using additional space
>>   (e.g., by using transformation on invalid UTF-8 encodings for Strings,
>> NaNs
>>   removed during NaN canonicalization for doubles, etc). Null comparess
>> less
>>   than any non-null value
>>   - Support for struct (composite) row keys with an arbitrary number of
>>   fields. Each field may have its own sort order. Structs are sorted by
>> field
>>   value.
>>
>> I have the code up on github at  http://github.com/mwdalton/orderly.
>> There
>> are javadocs for all the row key types explaining their serialization
>> format
>> and performance characteristics (start with the RowKey and StructRowKey
>> docs), as well as example code in src/example.
>>
>> Please let me know if you have any questions or if there's anything that
>> would be useful to add/change. Thanks!
>>
>> Best regards,
>>
>> Mike
>>
>
>

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Ted Dunning <td...@maprtech.com>.

Michael,

Interesting contribution to the open source community.  Sounds like nice
work.

Can you say how this relates to Avro with regard to collating of binary
data?

See, for instance, here: http://avro.apache.org/docs/current/spec.html#order

On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <mw...@gmail.com> wrote:

> Hi all,
>
> I'm with a startup, GotoMetrics, doing things with Hadoop  and I've gotten
> permission to open source Orderly -- our row key schema system for use in
> projects like HBase. Orderly allows you to serialize common data types
> (long, double, bigdecimal, etc) or structs/records of these types to byte
> arrays, and ensures that the byte arrays sort in the same natural order as
> the data type. You may then use the byte arrays as keys in HBase (or any
> sorted, byte-typed key-value store).
>
>  I'd really appreciate feedback about what parts or useful (or not useful),
> and if this would be something that would be appropriate to submit as a
> contrib to HBase itself (or if people would prefer me to submit derivative
> work to add composite row keys to Hive/Pig/etc).
>
> Here are the interesting features:
>
>   - All types are serialized a byte array that sorts in the natural order
>   of the underlying key for all key values (e.g., an Integer row key will
> sort
>   correctly for negative/positive values, a double will sort correctly for
>   negative/positive/zero/infinity/negative infinity/subnormals/etc - any
> valid
>   value)
>   - Both ascending and descending sort order are supported for all types
>   - Designed for space efficiency - tricks like using the end of a byte
>   array instead of a terminator byte, variable-length types whenever
> possible,
>   etc are all employed to minimize serialization length
>   - Support for row key prefixes/suffixes to combine with your own custom
>   encodings
>   - Variable-length integers (similar in theory to Zig-Zag encoding) are
>   supported, and their byte serialization preserves sort ordering
>   - BigDecimal support (like all other types, with sort ordering-preserving
>   byte serialization). To the best of my knowledge the first byte-sortable
>   BigDecimal serialization.
>   - Float/Double
>   - UTF-8 strings (with support for empty string, NULL, etc)
>   - Almost all types encode NULL, and do so without using additional space
>   (e.g., by using transformation on invalid UTF-8 encodings for Strings,
> NaNs
>   removed during NaN canonicalization for doubles, etc). Null comparess
> less
>   than any non-null value
>   - Support for struct (composite) row keys with an arbitrary number of
>   fields. Each field may have its own sort order. Structs are sorted by
> field
>   value.
>
> I have the code up on github at  http://github.com/mwdalton/orderly. There
> are javadocs for all the row key types explaining their serialization
> format
> and performance characteristics (start with the RowKey and StructRowKey
> docs), as well as example code in src/example.
>
> Please let me know if you have any questions or if there's anything that
> would be useful to add/change. Thanks!
>
> Best regards,
>
> Mike
>

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Posted by Andrew Purtell <ap...@apache.org>.

Michael (and GotoMetrics),

Thank you for opening this up!

Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


--- On Wed, 4/13/11, Michael Dalton <mw...@gmail.com> wrote:

> Hi all,
> 
> I'm with a startup, GotoMetrics, doing things with Hadoop and I've
> gotten permission to open source Orderly -- our row key schema system
> for use in projects like HBase. Orderly allows you to serialize common
> data types (long, double, bigdecimal, etc) or structs/records of these
> types to byte arrays, and ensures that the byte arrays sort in the
> same natural order as the data type. You may then use the byte arrays
> as keys in HBase (or any sorted, byte-typed key-value store).
[...]