You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Nick Dimiduk <nd...@gmail.com> on 2013/04/01 20:00:46 UTC

HBase Types: Explicit Null Support

Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case, the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization. This
imposes some limitations on our encoding strategies. For instance, it's not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Re: HBase Types: Explicit Null Support

Posted by Ted Yu <yu...@gmail.com>.

bq. with a base implementation that does not support nulls

+1


On Mon, Apr 1, 2013 at 1:32 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Thanks for the thoughtful response (and code!).
>
> I'm thinking I will press forward with a base implementation that does not
> support nulls. The idea is to provide an extensible set of interfaces, so I
> think this will not box us into a corner later. That is, a mirroring
> package could be implemented that supports null values and accepts
> the relevant trade-offs.
>
> Thanks,
> Nick
>
> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > I spent some time this weekend extracting bits of our serialization code
> to
> > a public github repo at http://github.com/hotpads/data-tools.
> >  Contributions are welcome - i'm sure we all have this stuff laying
> around.
> >
> > You can see I've bumped into the NULL problem in a few places:
> > *
> >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > *
> >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> > Looking back, I think my latest opinion on the topic is to reject
> > nullability as the rule since it can cause unexpected behavior and
> > confusion.  It's cleaner to provide a wrapper class (so both
> LongArrayList
> > plus NullableLongArrayList) that explicitly defines the behavior, and
> costs
> > a little more in performance.  If the user can't find a pre-made wrapper
> > class, it's not very difficult for each user to provide their own
> > interpretation of null and check for it themselves.
> >
> > If you reject nullability, the question becomes what to do in situations
> > where you're implementing existing interfaces that accept nullable
> params.
> >  The LongArrayList above implements List<Long> which requires an
> add(Long)
> > method.  In the above implementation I chose to swap nulls with
> > Long.MIN_VALUE, however I'm now thinking it best to force the user to
> make
> > that swap and then throw IllegalArgumentException if they pass null.
> >
> >
> > On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> doug.meil@explorysmedical.com
> > >wrote:
> >
> > >
> > > HmmmŠ good question.
> > >
> > > I think that fixed width support is important for a great many rowkey
> > > constructs cases, so I'd rather see something like losing MIN_VALUE and
> > > keeping fixed width.
> > >
> > >
> > >
> > >
> > > On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >
> > > >Heya,
> > > >
> > > >Thinking about data types and serialization. I think null support is
> an
> > > >important characteristic for the serialized representations,
> especially
> > > >when considering the compound type. However, doing so in directly
> > > >incompatible with fixed-width representations for numerics. For
> > instance,
> > > >if we want to have a fixed-width signed long stored on 8-bytes, where
> do
> > > >you put null? float and double types can cheat a little by folding
> > > >negative
> > > >and positive NaN's into a single representation (this isn't strictly
> > > >correct!), leaving a place to represent null. In the long example
> case,
> > > >the
> > > >obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
> This
> > > >will allocate an additional encoding which can be used for null. My
> > > >experience working with scientific data, however, makes me wince at
> the
> > > >idea.
> > > >
> > > >The variable-width encodings have it a little easier. There's already
> > > >enough going on that it's simpler to make room.
> > > >
> > > >Remember, the final goal is to support order-preserving serialization.
> > > >This
> > > >imposes some limitations on our encoding strategies. For instance,
> it's
> > > >not
> > > >enough to simply encode null, it really needs to be encoded as 0x00 so
> > as
> > > >to sort lexicographically earlier than any other value.
> > > >
> > > >What do you think? Any ideas, experiences, etc?
> > > >
> > > >Thanks,
> > > >Nick
> > >
> > >
> > >
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Michel Segel <mi...@hotmail.com>.

Silly question...
Null support. In a system where a column may or may not exist, how do you support null?

;-)

In terms of a key,  it's a primary key and can't be null.  


So what am I missing?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 1, 2013, at 10:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Furthermore, is is more important to support null values than squeeze all
> representations into minimum size (4-bytes for int32, &c.)?
> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> 
>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>wrote:
>> 
>>> From the SQL perspective, handling null is important.
>> 
>> 
>> From your perspective, it is critical to support NULLs, even at the
>> expense of fixed-width encodings at all or supporting representation of a
>> full range of values. That is, you'd rather be able to represent NULL than
>> -2^31?
>> 
>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>> 
>>>> Thanks for the thoughtful response (and code!).
>>>> 
>>>> I'm thinking I will press forward with a base implementation that does
>>>> not
>>>> support nulls. The idea is to provide an extensible set of interfaces,
>>>> so I
>>>> think this will not box us into a corner later. That is, a mirroring
>>>> package could be implemented that supports null values and accepts
>>>> the relevant trade-offs.
>>>> 
>>>> Thanks,
>>>> Nick
>>>> 
>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>> wrote:
>>>> 
>>>> I spent some time this weekend extracting bits of our serialization
>>>>> code to
>>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> .
>>>>>  Contributions are welcome - i'm sure we all have this stuff laying
>>>>> around.
>>>>> 
>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>> *
>>>>> 
>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> *
>>>>> 
>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> 
>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>> LongArrayList
>>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>>> costs
>>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>>> class, it's not very difficult for each user to provide their own
>>>>> interpretation of null and check for it themselves.
>>>>> 
>>>>> If you reject nullability, the question becomes what to do in situations
>>>>> where you're implementing existing interfaces that accept nullable
>>>>> params.
>>>>>  The LongArrayList above implements List<Long> which requires an
>>>>> add(Long)
>>>>> method.  In the above implementation I chose to swap nulls with
>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>>> make
>>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>> 
>>>>> 
>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>> doug.meil@explorysmedical.com
>>>>> 
>>>>>> wrote:
>>>>>> HmmmŠ good question.
>>>>>> 
>>>>>> I think that fixed width support is important for a great many rowkey
>>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>>> keeping fixed width.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>> 
>>>>>> Heya,
>>>>>>> 
>>>>>>> Thinking about data types and serialization. I think null support is
>>>>>>> an
>>>>>>> important characteristic for the serialized representations,
>>>>>>> especially
>>>>>>> when considering the compound type. However, doing so in directly
>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>> instance,
>>>>> 
>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>>> do
>>>>>>> you put null? float and double types can cheat a little by folding
>>>>>>> negative
>>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>>> case,
>>>>>>> the
>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>>> This
>>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>>> experience working with scientific data, however, makes me wince at
>>>>>>> the
>>>>>>> idea.
>>>>>>> 
>>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>>> enough going on that it's simpler to make room.
>>>>>>> 
>>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>>> This
>>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>>> it's
>>>>>>> not
>>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>> as
>>>>> 
>>>>>> to sort lexicographically earlier than any other value.
>>>>>>> 
>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Nick
>>

Re: HBase Types: Explicit Null Support

Posted by Michel Segel <mi...@hotmail.com>.

Silly question...
Null support. In a system where a column may or may not exist, how do you support null?

;-)

In terms of a key,  it's a primary key and can't be null.  


So what am I missing?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 1, 2013, at 10:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Furthermore, is is more important to support null values than squeeze all
> representations into minimum size (4-bytes for int32, &c.)?
> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> 
>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>wrote:
>> 
>>> From the SQL perspective, handling null is important.
>> 
>> 
>> From your perspective, it is critical to support NULLs, even at the
>> expense of fixed-width encodings at all or supporting representation of a
>> full range of values. That is, you'd rather be able to represent NULL than
>> -2^31?
>> 
>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>> 
>>>> Thanks for the thoughtful response (and code!).
>>>> 
>>>> I'm thinking I will press forward with a base implementation that does
>>>> not
>>>> support nulls. The idea is to provide an extensible set of interfaces,
>>>> so I
>>>> think this will not box us into a corner later. That is, a mirroring
>>>> package could be implemented that supports null values and accepts
>>>> the relevant trade-offs.
>>>> 
>>>> Thanks,
>>>> Nick
>>>> 
>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>> wrote:
>>>> 
>>>> I spent some time this weekend extracting bits of our serialization
>>>>> code to
>>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> .
>>>>>  Contributions are welcome - i'm sure we all have this stuff laying
>>>>> around.
>>>>> 
>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>> *
>>>>> 
>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> *
>>>>> 
>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> 
>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>> LongArrayList
>>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>>> costs
>>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>>> class, it's not very difficult for each user to provide their own
>>>>> interpretation of null and check for it themselves.
>>>>> 
>>>>> If you reject nullability, the question becomes what to do in situations
>>>>> where you're implementing existing interfaces that accept nullable
>>>>> params.
>>>>>  The LongArrayList above implements List<Long> which requires an
>>>>> add(Long)
>>>>> method.  In the above implementation I chose to swap nulls with
>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>>> make
>>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>> 
>>>>> 
>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>> doug.meil@explorysmedical.com
>>>>> 
>>>>>> wrote:
>>>>>> HmmmŠ good question.
>>>>>> 
>>>>>> I think that fixed width support is important for a great many rowkey
>>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>>> keeping fixed width.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>> 
>>>>>> Heya,
>>>>>>> 
>>>>>>> Thinking about data types and serialization. I think null support is
>>>>>>> an
>>>>>>> important characteristic for the serialized representations,
>>>>>>> especially
>>>>>>> when considering the compound type. However, doing so in directly
>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>> instance,
>>>>> 
>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>>> do
>>>>>>> you put null? float and double types can cheat a little by folding
>>>>>>> negative
>>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>>> case,
>>>>>>> the
>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>>> This
>>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>>> experience working with scientific data, however, makes me wince at
>>>>>>> the
>>>>>>> idea.
>>>>>>> 
>>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>>> enough going on that it's simpler to make room.
>>>>>>> 
>>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>>> This
>>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>>> it's
>>>>>>> not
>>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>> as
>>>>> 
>>>>>> to sort lexicographically earlier than any other value.
>>>>>>> 
>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Nick
>>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hiya Nick,
> Pig converts data for HBase storage using this class:
>
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
> is mostly just calling into HBase's Bytes class). As long as Bytes
> handles the null stuff, we'll just inherit the behavior.
>

Dmitriy,

Precisely how this will be exposed via the hbase client is TBD. We won't be
deprecating the existing Bytes utility from the client view, so a new API
for supporting these types will be provided. I'll be able to provide
support and/or a patch for Pig (et al) once  the implementation is a bit
further along.

My question for you as a Pig representative is more about how Pig users
expect Pig to handle NULLs. Are NULL values within a tuple a
common occurrence in Pig? In comparison, I'm thinking about the prevalence
of NULL in SQL.

Thanks,
Nick

On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > I agree that a user-extensible interface is a required feature here.
> > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> > keep in mind, though, that SQL and user applications are not the only
> > consumers of this interface. A big motivation is allowing interop with
> the
> > other higher MR languages. *cough* Where are my Pig and Hive peeps in
> this
> > thread?
> >
> > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> >
> > > Maybe if we can keep nullability separate from the
> > > serialization/deserialization, we can come up with a solution that
> works?
> > > We're able to essentially infer that a column is null based on its
> value
> > > being missing or empty. So if an iterator through the row key bytes
> could
> > > detect/indicate that, then an application could "infer" the value is
> > null.
> > >
> > > We're definitely planning on keeping byte[] accessors for use cases
> that
> > > need it. I'm curious on the geographic data case, though, could you
> use a
> > > fixed length long with a couple of new SQL built-ins to encode/decode
> the
> > > latitude/longitude?
> > >
> > >
> > > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> > >
> > >> Actually, that isn't all that far-fetched of a format Matt - pretty
> > common
> > >> anytime anyone wants to do sortable lat/long (*cough* three letter
> > >> agencies
> > >> cough*).
> > >>
> > >> Wouldn't we get the same by providing a simple set of libraries (ala
> > >> orderly + other HBase useful things) and then still give access to the
> > >> underlying byte array? Perhaps a nullable key type in that lib makes
> > sense
> > >> if lots of people need it and it would be nice to have standard
> > libraries
> > >> so tools could interop much more easily.
> > >> -------------------
> > >> Jesse Yates
> > >> @jesse_yates
> > >> jyates.github.com
> > >>
> > >>
> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >>
> > >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal
> of
> > >>> the
> > >>> interfaces should be to provide first-class support for custom user
> > types
> > >>> in addition to the standard ones included.  Part of the power of
> > hbase's
> > >>> plain byte[] keys is that users can concoct the perfect key for their
> > >>> data
> > >>> type.  For example, I have a lot of geographic data where I
> interleave
> > >>> latitude/longitude bits into a sortable 64 bit value that would
> > probably
> > >>> never be included in a standard library.
> > >>>
> > >>>
> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>  I think having Int32, and NullableInt32 would support minimum
> > overhead,
> > >>>>
> > >>> as
> > >>>
> > >>>> well as allowing SQL semantics.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>  Furthermore, is is more important to support null values than
> squeeze
> > >>>>>
> > >>>> all
> > >>>
> > >>>> representations into minimum size (4-bytes for int32, &c.)?
> > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> > jtaylor@salesforce.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>   From the SQL perspective, handling null is important.
> > >>>>>>>
> > >>>>>>
> > >>>>>>  From your perspective, it is critical to support NULLs, even at
> the
> > >>>>>> expense of fixed-width encodings at all or supporting
> representation
> > >>>>>>
> > >>>>> of a
> > >>>>
> > >>>>> full range of values. That is, you'd rather be able to represent
> NULL
> > >>>>>>
> > >>>>> than
> > >>>>>
> > >>>>>> -2^31?
> > >>>>>>
> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughtful response (and code!).
> > >>>>>>>>
> > >>>>>>>> I'm thinking I will press forward with a base implementation
> that
> > >>>>>>>>
> > >>>>>>> does
> > >>>>
> > >>>>>  not
> > >>>>>>>> support nulls. The idea is to provide an extensible set of
> > >>>>>>>>
> > >>>>>>> interfaces,
> > >>>>
> > >>>>>  so I
> > >>>>>>>> think this will not box us into a corner later. That is, a
> > >>>>>>>>
> > >>>>>>> mirroring
> > >>>
> > >>>>  package could be implemented that supports null values and accepts
> > >>>>>>>> the relevant trade-offs.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Nick
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>   I spent some time this weekend extracting bits of our
> > >>>>>>>>
> > >>>>>>> serialization
> > >>>
> > >>>>  code to
> > >>>>>>>>> a public github repo at
> http://github.com/hotpads/****data-tools
> > <http://github.com/hotpads/**data-tools>
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>> http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>>> >
> > >>>>>
> > >>>>>>  .
> > >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> > >>>>>>>>>
> > >>>>>>>> laying
> > >>>
> > >>>>  around.
> > >>>>>>>>>
> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> > >>>>>>>>> *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> > >>>>>>>>> **java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>
> > >>>>  *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> > >>>>>>>>> java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>
> > >>>>  Looking back, I think my latest opinion on the topic is to reject
> > >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> > and
> > >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>>>>>>> LongArrayList
> > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the
> behavior,
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>
> > >>>>>  costs
> > >>>>>>>>> a little more in performance.  If the user can't find a
> pre-made
> > >>>>>>>>>
> > >>>>>>>> wrapper
> > >>>>>
> > >>>>>>  class, it's not very difficult for each user to provide their own
> > >>>>>>>>> interpretation of null and check for it themselves.
> > >>>>>>>>>
> > >>>>>>>>> If you reject nullability, the question becomes what to do in
> > >>>>>>>>>
> > >>>>>>>> situations
> > >>>>>
> > >>>>>>  where you're implementing existing interfaces that accept
> nullable
> > >>>>>>>>> params.
> > >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> > an
> > >>>>>>>>> add(Long)
> > >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> > user
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>
> > >>>>>  make
> > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> > >>>>>>>>>
> > >>>>>>>> null.
> > >>>
> > >>>>
> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>>>>>>> doug.meil@explorysmedical.com
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>> HmmmŠ good question.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that fixed width support is important for a great many
> > >>>>>>>>>>
> > >>>>>>>>> rowkey
> > >>>>
> > >>>>>  constructs cases, so I'd rather see something like losing
> > >>>>>>>>>>
> > >>>>>>>>> MIN_VALUE
> > >>>
> > >>>> and
> > >>>>>
> > >>>>>>  keeping fixed width.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Heya,
> > >>>>>>>>>>
> > >>>>>>>>>>> Thinking about data types and serialization. I think null
> > >>>>>>>>>>>
> > >>>>>>>>>> support
> > >>>
> > >>>> is
> > >>>>
> > >>>>>  an
> > >>>>>>>>>>> important characteristic for the serialized representations,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>> when considering the compound type. However, doing so in
> > >>>>>>>>>>>
> > >>>>>>>>>> directly
> > >>>
> > >>>>  incompatible with fixed-width representations for numerics. For
> > >>>>>>>>>>>
> > >>>>>>>>>>>  instance,
> > >>>>>>>>>> if we want to have a fixed-width signed long stored on
> 8-bytes,
> > >>>>>>>>>>
> > >>>>>>>>> where
> > >>>>
> > >>>>>  do
> > >>>>>>>>>>> you put null? float and double types can cheat a little by
> > >>>>>>>>>>>
> > >>>>>>>>>> folding
> > >>>
> > >>>>  negative
> > >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> > >>>>>>>>>>>
> > >>>>>>>>>> strictly
> > >>>>
> > >>>>>  correct!), leaving a place to represent null. In the long
> > >>>>>>>>>>>
> > >>>>>>>>>> example
> > >>>
> > >>>>  case,
> > >>>>>>>>>>> the
> > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> one.
> > >>>>
> > >>>>>  This
> > >>>>>>>>>>> will allocate an additional encoding which can be used for
> > null.
> > >>>>>>>>>>>
> > >>>>>>>>>> My
> > >>>>
> > >>>>>  experience working with scientific data, however, makes me wince
> > >>>>>>>>>>>
> > >>>>>>>>>> at
> > >>>>
> > >>>>>  the
> > >>>>>>>>>>> idea.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> > >>>>>>>>>>>
> > >>>>>>>>>> already
> > >>>>>
> > >>>>>>  enough going on that it's simpler to make room.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Remember, the final goal is to support order-preserving
> > >>>>>>>>>>>
> > >>>>>>>>>> serialization.
> > >>>>>
> > >>>>>>  This
> > >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> > >>>>>>>>>>>
> > >>>>>>>>>> instance,
> > >>>
> > >>>>  it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded
> as
> > >>>>>>>>>>>
> > >>>>>>>>>> 0x00
> > >>>>
> > >>>>> so
> > >>>>>
> > >>>>>>  as
> > >>>>>>>>>> to sort lexicographically earlier than any other value.
> > >>>>>>>>>>
> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Nick
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hiya Nick,
> Pig converts data for HBase storage using this class:
>
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
> is mostly just calling into HBase's Bytes class). As long as Bytes
> handles the null stuff, we'll just inherit the behavior.
>

Dmitriy,

Precisely how this will be exposed via the hbase client is TBD. We won't be
deprecating the existing Bytes utility from the client view, so a new API
for supporting these types will be provided. I'll be able to provide
support and/or a patch for Pig (et al) once  the implementation is a bit
further along.

My question for you as a Pig representative is more about how Pig users
expect Pig to handle NULLs. Are NULL values within a tuple a
common occurrence in Pig? In comparison, I'm thinking about the prevalence
of NULL in SQL.

Thanks,
Nick

On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > I agree that a user-extensible interface is a required feature here.
> > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> > keep in mind, though, that SQL and user applications are not the only
> > consumers of this interface. A big motivation is allowing interop with
> the
> > other higher MR languages. *cough* Where are my Pig and Hive peeps in
> this
> > thread?
> >
> > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> >
> > > Maybe if we can keep nullability separate from the
> > > serialization/deserialization, we can come up with a solution that
> works?
> > > We're able to essentially infer that a column is null based on its
> value
> > > being missing or empty. So if an iterator through the row key bytes
> could
> > > detect/indicate that, then an application could "infer" the value is
> > null.
> > >
> > > We're definitely planning on keeping byte[] accessors for use cases
> that
> > > need it. I'm curious on the geographic data case, though, could you
> use a
> > > fixed length long with a couple of new SQL built-ins to encode/decode
> the
> > > latitude/longitude?
> > >
> > >
> > > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> > >
> > >> Actually, that isn't all that far-fetched of a format Matt - pretty
> > common
> > >> anytime anyone wants to do sortable lat/long (*cough* three letter
> > >> agencies
> > >> cough*).
> > >>
> > >> Wouldn't we get the same by providing a simple set of libraries (ala
> > >> orderly + other HBase useful things) and then still give access to the
> > >> underlying byte array? Perhaps a nullable key type in that lib makes
> > sense
> > >> if lots of people need it and it would be nice to have standard
> > libraries
> > >> so tools could interop much more easily.
> > >> -------------------
> > >> Jesse Yates
> > >> @jesse_yates
> > >> jyates.github.com
> > >>
> > >>
> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >>
> > >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal
> of
> > >>> the
> > >>> interfaces should be to provide first-class support for custom user
> > types
> > >>> in addition to the standard ones included.  Part of the power of
> > hbase's
> > >>> plain byte[] keys is that users can concoct the perfect key for their
> > >>> data
> > >>> type.  For example, I have a lot of geographic data where I
> interleave
> > >>> latitude/longitude bits into a sortable 64 bit value that would
> > probably
> > >>> never be included in a standard library.
> > >>>
> > >>>
> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>  I think having Int32, and NullableInt32 would support minimum
> > overhead,
> > >>>>
> > >>> as
> > >>>
> > >>>> well as allowing SQL semantics.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>  Furthermore, is is more important to support null values than
> squeeze
> > >>>>>
> > >>>> all
> > >>>
> > >>>> representations into minimum size (4-bytes for int32, &c.)?
> > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> > jtaylor@salesforce.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>   From the SQL perspective, handling null is important.
> > >>>>>>>
> > >>>>>>
> > >>>>>>  From your perspective, it is critical to support NULLs, even at
> the
> > >>>>>> expense of fixed-width encodings at all or supporting
> representation
> > >>>>>>
> > >>>>> of a
> > >>>>
> > >>>>> full range of values. That is, you'd rather be able to represent
> NULL
> > >>>>>>
> > >>>>> than
> > >>>>>
> > >>>>>> -2^31?
> > >>>>>>
> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughtful response (and code!).
> > >>>>>>>>
> > >>>>>>>> I'm thinking I will press forward with a base implementation
> that
> > >>>>>>>>
> > >>>>>>> does
> > >>>>
> > >>>>>  not
> > >>>>>>>> support nulls. The idea is to provide an extensible set of
> > >>>>>>>>
> > >>>>>>> interfaces,
> > >>>>
> > >>>>>  so I
> > >>>>>>>> think this will not box us into a corner later. That is, a
> > >>>>>>>>
> > >>>>>>> mirroring
> > >>>
> > >>>>  package could be implemented that supports null values and accepts
> > >>>>>>>> the relevant trade-offs.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Nick
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>   I spent some time this weekend extracting bits of our
> > >>>>>>>>
> > >>>>>>> serialization
> > >>>
> > >>>>  code to
> > >>>>>>>>> a public github repo at
> http://github.com/hotpads/****data-tools
> > <http://github.com/hotpads/**data-tools>
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>> http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>>> >
> > >>>>>
> > >>>>>>  .
> > >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> > >>>>>>>>>
> > >>>>>>>> laying
> > >>>
> > >>>>  around.
> > >>>>>>>>>
> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> > >>>>>>>>> *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> > >>>>>>>>> **java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>
> > >>>>  *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> > >>>>>>>>> java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>
> > >>>>  Looking back, I think my latest opinion on the topic is to reject
> > >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> > and
> > >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>>>>>>> LongArrayList
> > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the
> behavior,
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>
> > >>>>>  costs
> > >>>>>>>>> a little more in performance.  If the user can't find a
> pre-made
> > >>>>>>>>>
> > >>>>>>>> wrapper
> > >>>>>
> > >>>>>>  class, it's not very difficult for each user to provide their own
> > >>>>>>>>> interpretation of null and check for it themselves.
> > >>>>>>>>>
> > >>>>>>>>> If you reject nullability, the question becomes what to do in
> > >>>>>>>>>
> > >>>>>>>> situations
> > >>>>>
> > >>>>>>  where you're implementing existing interfaces that accept
> nullable
> > >>>>>>>>> params.
> > >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> > an
> > >>>>>>>>> add(Long)
> > >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> > user
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>
> > >>>>>  make
> > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> > >>>>>>>>>
> > >>>>>>>> null.
> > >>>
> > >>>>
> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>>>>>>> doug.meil@explorysmedical.com
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>> HmmmŠ good question.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that fixed width support is important for a great many
> > >>>>>>>>>>
> > >>>>>>>>> rowkey
> > >>>>
> > >>>>>  constructs cases, so I'd rather see something like losing
> > >>>>>>>>>>
> > >>>>>>>>> MIN_VALUE
> > >>>
> > >>>> and
> > >>>>>
> > >>>>>>  keeping fixed width.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Heya,
> > >>>>>>>>>>
> > >>>>>>>>>>> Thinking about data types and serialization. I think null
> > >>>>>>>>>>>
> > >>>>>>>>>> support
> > >>>
> > >>>> is
> > >>>>
> > >>>>>  an
> > >>>>>>>>>>> important characteristic for the serialized representations,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>> when considering the compound type. However, doing so in
> > >>>>>>>>>>>
> > >>>>>>>>>> directly
> > >>>
> > >>>>  incompatible with fixed-width representations for numerics. For
> > >>>>>>>>>>>
> > >>>>>>>>>>>  instance,
> > >>>>>>>>>> if we want to have a fixed-width signed long stored on
> 8-bytes,
> > >>>>>>>>>>
> > >>>>>>>>> where
> > >>>>
> > >>>>>  do
> > >>>>>>>>>>> you put null? float and double types can cheat a little by
> > >>>>>>>>>>>
> > >>>>>>>>>> folding
> > >>>
> > >>>>  negative
> > >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> > >>>>>>>>>>>
> > >>>>>>>>>> strictly
> > >>>>
> > >>>>>  correct!), leaving a place to represent null. In the long
> > >>>>>>>>>>>
> > >>>>>>>>>> example
> > >>>
> > >>>>  case,
> > >>>>>>>>>>> the
> > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> one.
> > >>>>
> > >>>>>  This
> > >>>>>>>>>>> will allocate an additional encoding which can be used for
> > null.
> > >>>>>>>>>>>
> > >>>>>>>>>> My
> > >>>>
> > >>>>>  experience working with scientific data, however, makes me wince
> > >>>>>>>>>>>
> > >>>>>>>>>> at
> > >>>>
> > >>>>>  the
> > >>>>>>>>>>> idea.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> > >>>>>>>>>>>
> > >>>>>>>>>> already
> > >>>>>
> > >>>>>>  enough going on that it's simpler to make room.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Remember, the final goal is to support order-preserving
> > >>>>>>>>>>>
> > >>>>>>>>>> serialization.
> > >>>>>
> > >>>>>>  This
> > >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> > >>>>>>>>>>>
> > >>>>>>>>>> instance,
> > >>>
> > >>>>  it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded
> as
> > >>>>>>>>>>>
> > >>>>>>>>>> 0x00
> > >>>>
> > >>>>> so
> > >>>>>
> > >>>>>>  as
> > >>>>>>>>>> to sort lexicographically earlier than any other value.
> > >>>>>>>>>>
> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Nick
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hiya Nick,
> Pig converts data for HBase storage using this class:
>
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
> is mostly just calling into HBase's Bytes class). As long as Bytes
> handles the null stuff, we'll just inherit the behavior.
>

Dmitriy,

Precisely how this will be exposed via the hbase client is TBD. We won't be
deprecating the existing Bytes utility from the client view, so a new API
for supporting these types will be provided. I'll be able to provide
support and/or a patch for Pig (et al) once  the implementation is a bit
further along.

My question for you as a Pig representative is more about how Pig users
expect Pig to handle NULLs. Are NULL values within a tuple a
common occurrence in Pig? In comparison, I'm thinking about the prevalence
of NULL in SQL.

Thanks,
Nick

On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > I agree that a user-extensible interface is a required feature here.
> > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> > keep in mind, though, that SQL and user applications are not the only
> > consumers of this interface. A big motivation is allowing interop with
> the
> > other higher MR languages. *cough* Where are my Pig and Hive peeps in
> this
> > thread?
> >
> > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> >
> > > Maybe if we can keep nullability separate from the
> > > serialization/deserialization, we can come up with a solution that
> works?
> > > We're able to essentially infer that a column is null based on its
> value
> > > being missing or empty. So if an iterator through the row key bytes
> could
> > > detect/indicate that, then an application could "infer" the value is
> > null.
> > >
> > > We're definitely planning on keeping byte[] accessors for use cases
> that
> > > need it. I'm curious on the geographic data case, though, could you
> use a
> > > fixed length long with a couple of new SQL built-ins to encode/decode
> the
> > > latitude/longitude?
> > >
> > >
> > > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> > >
> > >> Actually, that isn't all that far-fetched of a format Matt - pretty
> > common
> > >> anytime anyone wants to do sortable lat/long (*cough* three letter
> > >> agencies
> > >> cough*).
> > >>
> > >> Wouldn't we get the same by providing a simple set of libraries (ala
> > >> orderly + other HBase useful things) and then still give access to the
> > >> underlying byte array? Perhaps a nullable key type in that lib makes
> > sense
> > >> if lots of people need it and it would be nice to have standard
> > libraries
> > >> so tools could interop much more easily.
> > >> -------------------
> > >> Jesse Yates
> > >> @jesse_yates
> > >> jyates.github.com
> > >>
> > >>
> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >>
> > >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal
> of
> > >>> the
> > >>> interfaces should be to provide first-class support for custom user
> > types
> > >>> in addition to the standard ones included.  Part of the power of
> > hbase's
> > >>> plain byte[] keys is that users can concoct the perfect key for their
> > >>> data
> > >>> type.  For example, I have a lot of geographic data where I
> interleave
> > >>> latitude/longitude bits into a sortable 64 bit value that would
> > probably
> > >>> never be included in a standard library.
> > >>>
> > >>>
> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>  I think having Int32, and NullableInt32 would support minimum
> > overhead,
> > >>>>
> > >>> as
> > >>>
> > >>>> well as allowing SQL semantics.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>  Furthermore, is is more important to support null values than
> squeeze
> > >>>>>
> > >>>> all
> > >>>
> > >>>> representations into minimum size (4-bytes for int32, &c.)?
> > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> > jtaylor@salesforce.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>   From the SQL perspective, handling null is important.
> > >>>>>>>
> > >>>>>>
> > >>>>>>  From your perspective, it is critical to support NULLs, even at
> the
> > >>>>>> expense of fixed-width encodings at all or supporting
> representation
> > >>>>>>
> > >>>>> of a
> > >>>>
> > >>>>> full range of values. That is, you'd rather be able to represent
> NULL
> > >>>>>>
> > >>>>> than
> > >>>>>
> > >>>>>> -2^31?
> > >>>>>>
> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughtful response (and code!).
> > >>>>>>>>
> > >>>>>>>> I'm thinking I will press forward with a base implementation
> that
> > >>>>>>>>
> > >>>>>>> does
> > >>>>
> > >>>>>  not
> > >>>>>>>> support nulls. The idea is to provide an extensible set of
> > >>>>>>>>
> > >>>>>>> interfaces,
> > >>>>
> > >>>>>  so I
> > >>>>>>>> think this will not box us into a corner later. That is, a
> > >>>>>>>>
> > >>>>>>> mirroring
> > >>>
> > >>>>  package could be implemented that supports null values and accepts
> > >>>>>>>> the relevant trade-offs.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Nick
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>   I spent some time this weekend extracting bits of our
> > >>>>>>>>
> > >>>>>>> serialization
> > >>>
> > >>>>  code to
> > >>>>>>>>> a public github repo at
> http://github.com/hotpads/****data-tools
> > <http://github.com/hotpads/**data-tools>
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>> http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>>> >
> > >>>>>
> > >>>>>>  .
> > >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> > >>>>>>>>>
> > >>>>>>>> laying
> > >>>
> > >>>>  around.
> > >>>>>>>>>
> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> > >>>>>>>>> *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> > >>>>>>>>> **java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>
> > >>>>  *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> > >>>>>>>>> java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>
> > >>>>  Looking back, I think my latest opinion on the topic is to reject
> > >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> > and
> > >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>>>>>>> LongArrayList
> > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the
> behavior,
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>
> > >>>>>  costs
> > >>>>>>>>> a little more in performance.  If the user can't find a
> pre-made
> > >>>>>>>>>
> > >>>>>>>> wrapper
> > >>>>>
> > >>>>>>  class, it's not very difficult for each user to provide their own
> > >>>>>>>>> interpretation of null and check for it themselves.
> > >>>>>>>>>
> > >>>>>>>>> If you reject nullability, the question becomes what to do in
> > >>>>>>>>>
> > >>>>>>>> situations
> > >>>>>
> > >>>>>>  where you're implementing existing interfaces that accept
> nullable
> > >>>>>>>>> params.
> > >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> > an
> > >>>>>>>>> add(Long)
> > >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> > user
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>
> > >>>>>  make
> > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> > >>>>>>>>>
> > >>>>>>>> null.
> > >>>
> > >>>>
> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>>>>>>> doug.meil@explorysmedical.com
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>> HmmmŠ good question.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that fixed width support is important for a great many
> > >>>>>>>>>>
> > >>>>>>>>> rowkey
> > >>>>
> > >>>>>  constructs cases, so I'd rather see something like losing
> > >>>>>>>>>>
> > >>>>>>>>> MIN_VALUE
> > >>>
> > >>>> and
> > >>>>>
> > >>>>>>  keeping fixed width.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Heya,
> > >>>>>>>>>>
> > >>>>>>>>>>> Thinking about data types and serialization. I think null
> > >>>>>>>>>>>
> > >>>>>>>>>> support
> > >>>
> > >>>> is
> > >>>>
> > >>>>>  an
> > >>>>>>>>>>> important characteristic for the serialized representations,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>> when considering the compound type. However, doing so in
> > >>>>>>>>>>>
> > >>>>>>>>>> directly
> > >>>
> > >>>>  incompatible with fixed-width representations for numerics. For
> > >>>>>>>>>>>
> > >>>>>>>>>>>  instance,
> > >>>>>>>>>> if we want to have a fixed-width signed long stored on
> 8-bytes,
> > >>>>>>>>>>
> > >>>>>>>>> where
> > >>>>
> > >>>>>  do
> > >>>>>>>>>>> you put null? float and double types can cheat a little by
> > >>>>>>>>>>>
> > >>>>>>>>>> folding
> > >>>
> > >>>>  negative
> > >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> > >>>>>>>>>>>
> > >>>>>>>>>> strictly
> > >>>>
> > >>>>>  correct!), leaving a place to represent null. In the long
> > >>>>>>>>>>>
> > >>>>>>>>>> example
> > >>>
> > >>>>  case,
> > >>>>>>>>>>> the
> > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> one.
> > >>>>
> > >>>>>  This
> > >>>>>>>>>>> will allocate an additional encoding which can be used for
> > null.
> > >>>>>>>>>>>
> > >>>>>>>>>> My
> > >>>>
> > >>>>>  experience working with scientific data, however, makes me wince
> > >>>>>>>>>>>
> > >>>>>>>>>> at
> > >>>>
> > >>>>>  the
> > >>>>>>>>>>> idea.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> > >>>>>>>>>>>
> > >>>>>>>>>> already
> > >>>>>
> > >>>>>>  enough going on that it's simpler to make room.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Remember, the final goal is to support order-preserving
> > >>>>>>>>>>>
> > >>>>>>>>>> serialization.
> > >>>>>
> > >>>>>>  This
> > >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> > >>>>>>>>>>>
> > >>>>>>>>>> instance,
> > >>>
> > >>>>  it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded
> as
> > >>>>>>>>>>>
> > >>>>>>>>>> 0x00
> > >>>>
> > >>>>> so
> > >>>>>
> > >>>>>>  as
> > >>>>>>>>>> to sort lexicographically earlier than any other value.
> > >>>>>>>>>>
> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Nick
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Wed, Apr 3, 2013 at 11:29 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Hiya Nick,
> Pig converts data for HBase storage using this class:
>
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
> is mostly just calling into HBase's Bytes class). As long as Bytes
> handles the null stuff, we'll just inherit the behavior.
>

Dmitriy,

Precisely how this will be exposed via the hbase client is TBD. We won't be
deprecating the existing Bytes utility from the client view, so a new API
for supporting these types will be provided. I'll be able to provide
support and/or a patch for Pig (et al) once  the implementation is a bit
further along.

My question for you as a Pig representative is more about how Pig users
expect Pig to handle NULLs. Are NULL values within a tuple a
common occurrence in Pig? In comparison, I'm thinking about the prevalence
of NULL in SQL.

Thanks,
Nick

On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > I agree that a user-extensible interface is a required feature here.
> > Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> > keep in mind, though, that SQL and user applications are not the only
> > consumers of this interface. A big motivation is allowing interop with
> the
> > other higher MR languages. *cough* Where are my Pig and Hive peeps in
> this
> > thread?
> >
> > On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> >
> > > Maybe if we can keep nullability separate from the
> > > serialization/deserialization, we can come up with a solution that
> works?
> > > We're able to essentially infer that a column is null based on its
> value
> > > being missing or empty. So if an iterator through the row key bytes
> could
> > > detect/indicate that, then an application could "infer" the value is
> > null.
> > >
> > > We're definitely planning on keeping byte[] accessors for use cases
> that
> > > need it. I'm curious on the geographic data case, though, could you
> use a
> > > fixed length long with a couple of new SQL built-ins to encode/decode
> the
> > > latitude/longitude?
> > >
> > >
> > > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> > >
> > >> Actually, that isn't all that far-fetched of a format Matt - pretty
> > common
> > >> anytime anyone wants to do sortable lat/long (*cough* three letter
> > >> agencies
> > >> cough*).
> > >>
> > >> Wouldn't we get the same by providing a simple set of libraries (ala
> > >> orderly + other HBase useful things) and then still give access to the
> > >> underlying byte array? Perhaps a nullable key type in that lib makes
> > sense
> > >> if lots of people need it and it would be nice to have standard
> > libraries
> > >> so tools could interop much more easily.
> > >> -------------------
> > >> Jesse Yates
> > >> @jesse_yates
> > >> jyates.github.com
> > >>
> > >>
> > >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >>
> > >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal
> of
> > >>> the
> > >>> interfaces should be to provide first-class support for custom user
> > types
> > >>> in addition to the standard ones included.  Part of the power of
> > hbase's
> > >>> plain byte[] keys is that users can concoct the perfect key for their
> > >>> data
> > >>> type.  For example, I have a lot of geographic data where I
> interleave
> > >>> latitude/longitude bits into a sortable 64 bit value that would
> > probably
> > >>> never be included in a standard library.
> > >>>
> > >>>
> > >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>  I think having Int32, and NullableInt32 would support minimum
> > overhead,
> > >>>>
> > >>> as
> > >>>
> > >>>> well as allowing SQL semantics.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>  Furthermore, is is more important to support null values than
> squeeze
> > >>>>>
> > >>>> all
> > >>>
> > >>>> representations into minimum size (4-bytes for int32, &c.)?
> > >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> > jtaylor@salesforce.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>   From the SQL perspective, handling null is important.
> > >>>>>>>
> > >>>>>>
> > >>>>>>  From your perspective, it is critical to support NULLs, even at
> the
> > >>>>>> expense of fixed-width encodings at all or supporting
> representation
> > >>>>>>
> > >>>>> of a
> > >>>>
> > >>>>> full range of values. That is, you'd rather be able to represent
> NULL
> > >>>>>>
> > >>>>> than
> > >>>>>
> > >>>>>> -2^31?
> > >>>>>>
> > >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>>>>>
> > >>>>>>> Thanks for the thoughtful response (and code!).
> > >>>>>>>>
> > >>>>>>>> I'm thinking I will press forward with a base implementation
> that
> > >>>>>>>>
> > >>>>>>> does
> > >>>>
> > >>>>>  not
> > >>>>>>>> support nulls. The idea is to provide an extensible set of
> > >>>>>>>>
> > >>>>>>> interfaces,
> > >>>>
> > >>>>>  so I
> > >>>>>>>> think this will not box us into a corner later. That is, a
> > >>>>>>>>
> > >>>>>>> mirroring
> > >>>
> > >>>>  package could be implemented that supports null values and accepts
> > >>>>>>>> the relevant trade-offs.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Nick
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>   I spent some time this weekend extracting bits of our
> > >>>>>>>>
> > >>>>>>> serialization
> > >>>
> > >>>>  code to
> > >>>>>>>>> a public github repo at
> http://github.com/hotpads/****data-tools
> > <http://github.com/hotpads/**data-tools>
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>> http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>>> >
> > >>>>>
> > >>>>>>  .
> > >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> > >>>>>>>>>
> > >>>>>>>> laying
> > >>>
> > >>>>  around.
> > >>>>>>>>>
> > >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> > >>>>>>>>> *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> > >>>>>>>>> **java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>
> > >>>>  *
> > >>>>>>>>>
> > >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> > https://github.com/hotpads/**data-tools/blob/master/src/**>
> > >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> > >>>>>>>>> java<
> > >>>>>>>>>
> > >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>
> > >>>>  Looking back, I think my latest opinion on the topic is to reject
> > >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> > and
> > >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>>>>>>> LongArrayList
> > >>>>>>>>> plus NullableLongArrayList) that explicitly defines the
> behavior,
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>
> > >>>>>  costs
> > >>>>>>>>> a little more in performance.  If the user can't find a
> pre-made
> > >>>>>>>>>
> > >>>>>>>> wrapper
> > >>>>>
> > >>>>>>  class, it's not very difficult for each user to provide their own
> > >>>>>>>>> interpretation of null and check for it themselves.
> > >>>>>>>>>
> > >>>>>>>>> If you reject nullability, the question becomes what to do in
> > >>>>>>>>>
> > >>>>>>>> situations
> > >>>>>
> > >>>>>>  where you're implementing existing interfaces that accept
> nullable
> > >>>>>>>>> params.
> > >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> > an
> > >>>>>>>>> add(Long)
> > >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> > >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> > user
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>
> > >>>>>  make
> > >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> > >>>>>>>>>
> > >>>>>>>> null.
> > >>>
> > >>>>
> > >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>>>>>>> doug.meil@explorysmedical.com
> > >>>>>>>>>
> > >>>>>>>>>  wrote:
> > >>>>>>>>>> HmmmŠ good question.
> > >>>>>>>>>>
> > >>>>>>>>>> I think that fixed width support is important for a great many
> > >>>>>>>>>>
> > >>>>>>>>> rowkey
> > >>>>
> > >>>>>  constructs cases, so I'd rather see something like losing
> > >>>>>>>>>>
> > >>>>>>>>> MIN_VALUE
> > >>>
> > >>>> and
> > >>>>>
> > >>>>>>  keeping fixed width.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Heya,
> > >>>>>>>>>>
> > >>>>>>>>>>> Thinking about data types and serialization. I think null
> > >>>>>>>>>>>
> > >>>>>>>>>> support
> > >>>
> > >>>> is
> > >>>>
> > >>>>>  an
> > >>>>>>>>>>> important characteristic for the serialized representations,
> > >>>>>>>>>>> especially
> > >>>>>>>>>>> when considering the compound type. However, doing so in
> > >>>>>>>>>>>
> > >>>>>>>>>> directly
> > >>>
> > >>>>  incompatible with fixed-width representations for numerics. For
> > >>>>>>>>>>>
> > >>>>>>>>>>>  instance,
> > >>>>>>>>>> if we want to have a fixed-width signed long stored on
> 8-bytes,
> > >>>>>>>>>>
> > >>>>>>>>> where
> > >>>>
> > >>>>>  do
> > >>>>>>>>>>> you put null? float and double types can cheat a little by
> > >>>>>>>>>>>
> > >>>>>>>>>> folding
> > >>>
> > >>>>  negative
> > >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> > >>>>>>>>>>>
> > >>>>>>>>>> strictly
> > >>>>
> > >>>>>  correct!), leaving a place to represent null. In the long
> > >>>>>>>>>>>
> > >>>>>>>>>> example
> > >>>
> > >>>>  case,
> > >>>>>>>>>>> the
> > >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE
> by
> > >>>>>>>>>>>
> > >>>>>>>>>> one.
> > >>>>
> > >>>>>  This
> > >>>>>>>>>>> will allocate an additional encoding which can be used for
> > null.
> > >>>>>>>>>>>
> > >>>>>>>>>> My
> > >>>>
> > >>>>>  experience working with scientific data, however, makes me wince
> > >>>>>>>>>>>
> > >>>>>>>>>> at
> > >>>>
> > >>>>>  the
> > >>>>>>>>>>> idea.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> > >>>>>>>>>>>
> > >>>>>>>>>> already
> > >>>>>
> > >>>>>>  enough going on that it's simpler to make room.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Remember, the final goal is to support order-preserving
> > >>>>>>>>>>>
> > >>>>>>>>>> serialization.
> > >>>>>
> > >>>>>>  This
> > >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> > >>>>>>>>>>>
> > >>>>>>>>>> instance,
> > >>>
> > >>>>  it's
> > >>>>>>>>>>> not
> > >>>>>>>>>>> enough to simply encode null, it really needs to be encoded
> as
> > >>>>>>>>>>>
> > >>>>>>>>>> 0x00
> > >>>>
> > >>>>> so
> > >>>>>
> > >>>>>>  as
> > >>>>>>>>>> to sort lexicographically earlier than any other value.
> > >>>>>>>>>>
> > >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Nick
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hiya Nick,
Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.


On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> I agree that a user-extensible interface is a required feature here.
> Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> keep in mind, though, that SQL and user applications are not the only
> consumers of this interface. A big motivation is allowing interop with the
> other higher MR languages. *cough* Where are my Pig and Hive peeps in this
> thread?
>
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > Maybe if we can keep nullability separate from the
> > serialization/deserialization, we can come up with a solution that works?
> > We're able to essentially infer that a column is null based on its value
> > being missing or empty. So if an iterator through the row key bytes could
> > detect/indicate that, then an application could "infer" the value is
> null.
> >
> > We're definitely planning on keeping byte[] accessors for use cases that
> > need it. I'm curious on the geographic data case, though, could you use a
> > fixed length long with a couple of new SQL built-ins to encode/decode the
> > latitude/longitude?
> >
> >
> > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> >
> >> Actually, that isn't all that far-fetched of a format Matt - pretty
> common
> >> anytime anyone wants to do sortable lat/long (*cough* three letter
> >> agencies
> >> cough*).
> >>
> >> Wouldn't we get the same by providing a simple set of libraries (ala
> >> orderly + other HBase useful things) and then still give access to the
> >> underlying byte array? Perhaps a nullable key type in that lib makes
> sense
> >> if lots of people need it and it would be nice to have standard
> libraries
> >> so tools could interop much more easily.
> >> -------------------
> >> Jesse Yates
> >> @jesse_yates
> >> jyates.github.com
> >>
> >>
> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >>
> >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
> >>> the
> >>> interfaces should be to provide first-class support for custom user
> types
> >>> in addition to the standard ones included.  Part of the power of
> hbase's
> >>> plain byte[] keys is that users can concoct the perfect key for their
> >>> data
> >>> type.  For example, I have a lot of geographic data where I interleave
> >>> latitude/longitude bits into a sortable 64 bit value that would
> probably
> >>> never be included in a standard library.
> >>>
> >>>
> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> >>> wrote:
> >>>
> >>>  I think having Int32, and NullableInt32 would support minimum
> overhead,
> >>>>
> >>> as
> >>>
> >>>> well as allowing SQL semantics.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>  Furthermore, is is more important to support null values than squeeze
> >>>>>
> >>>> all
> >>>
> >>>> representations into minimum size (4-bytes for int32, &c.)?
> >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> jtaylor@salesforce.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>   From the SQL perspective, handling null is important.
> >>>>>>>
> >>>>>>
> >>>>>>  From your perspective, it is critical to support NULLs, even at the
> >>>>>> expense of fixed-width encodings at all or supporting representation
> >>>>>>
> >>>>> of a
> >>>>
> >>>>> full range of values. That is, you'd rather be able to represent NULL
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> -2^31?
> >>>>>>
> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughtful response (and code!).
> >>>>>>>>
> >>>>>>>> I'm thinking I will press forward with a base implementation that
> >>>>>>>>
> >>>>>>> does
> >>>>
> >>>>>  not
> >>>>>>>> support nulls. The idea is to provide an extensible set of
> >>>>>>>>
> >>>>>>> interfaces,
> >>>>
> >>>>>  so I
> >>>>>>>> think this will not box us into a corner later. That is, a
> >>>>>>>>
> >>>>>>> mirroring
> >>>
> >>>>  package could be implemented that supports null values and accepts
> >>>>>>>> the relevant trade-offs.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   I spent some time this weekend extracting bits of our
> >>>>>>>>
> >>>>>>> serialization
> >>>
> >>>>  code to
> >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools
> <http://github.com/hotpads/**data-tools>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>> http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>>
> >>>>>>  .
> >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> >>>>>>>>>
> >>>>>>>> laying
> >>>
> >>>>  around.
> >>>>>>>>>
> >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>>>>>> *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> >>>>>>>>> **java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>
> >>>>  *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> >>>>>>>>> java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>
> >>>>  Looking back, I think my latest opinion on the topic is to reject
> >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> and
> >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>>>>>> LongArrayList
> >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
> >>>>>>>>>
> >>>>>>>> and
> >>>>
> >>>>>  costs
> >>>>>>>>> a little more in performance.  If the user can't find a pre-made
> >>>>>>>>>
> >>>>>>>> wrapper
> >>>>>
> >>>>>>  class, it's not very difficult for each user to provide their own
> >>>>>>>>> interpretation of null and check for it themselves.
> >>>>>>>>>
> >>>>>>>>> If you reject nullability, the question becomes what to do in
> >>>>>>>>>
> >>>>>>>> situations
> >>>>>
> >>>>>>  where you're implementing existing interfaces that accept nullable
> >>>>>>>>> params.
> >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> an
> >>>>>>>>> add(Long)
> >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> user
> >>>>>>>>>
> >>>>>>>> to
> >>>>
> >>>>>  make
> >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> >>>>>>>>>
> >>>>>>>> null.
> >>>
> >>>>
> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>>>>>> doug.meil@explorysmedical.com
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>> HmmmŠ good question.
> >>>>>>>>>>
> >>>>>>>>>> I think that fixed width support is important for a great many
> >>>>>>>>>>
> >>>>>>>>> rowkey
> >>>>
> >>>>>  constructs cases, so I'd rather see something like losing
> >>>>>>>>>>
> >>>>>>>>> MIN_VALUE
> >>>
> >>>> and
> >>>>>
> >>>>>>  keeping fixed width.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Heya,
> >>>>>>>>>>
> >>>>>>>>>>> Thinking about data types and serialization. I think null
> >>>>>>>>>>>
> >>>>>>>>>> support
> >>>
> >>>> is
> >>>>
> >>>>>  an
> >>>>>>>>>>> important characteristic for the serialized representations,
> >>>>>>>>>>> especially
> >>>>>>>>>>> when considering the compound type. However, doing so in
> >>>>>>>>>>>
> >>>>>>>>>> directly
> >>>
> >>>>  incompatible with fixed-width representations for numerics. For
> >>>>>>>>>>>
> >>>>>>>>>>>  instance,
> >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> >>>>>>>>>>
> >>>>>>>>> where
> >>>>
> >>>>>  do
> >>>>>>>>>>> you put null? float and double types can cheat a little by
> >>>>>>>>>>>
> >>>>>>>>>> folding
> >>>
> >>>>  negative
> >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> >>>>>>>>>>>
> >>>>>>>>>> strictly
> >>>>
> >>>>>  correct!), leaving a place to represent null. In the long
> >>>>>>>>>>>
> >>>>>>>>>> example
> >>>
> >>>>  case,
> >>>>>>>>>>> the
> >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> >>>>>>>>>>>
> >>>>>>>>>> one.
> >>>>
> >>>>>  This
> >>>>>>>>>>> will allocate an additional encoding which can be used for
> null.
> >>>>>>>>>>>
> >>>>>>>>>> My
> >>>>
> >>>>>  experience working with scientific data, however, makes me wince
> >>>>>>>>>>>
> >>>>>>>>>> at
> >>>>
> >>>>>  the
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> >>>>>>>>>>>
> >>>>>>>>>> already
> >>>>>
> >>>>>>  enough going on that it's simpler to make room.
> >>>>>>>>>>>
> >>>>>>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>>>>>>
> >>>>>>>>>> serialization.
> >>>>>
> >>>>>>  This
> >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> >>>>>>>>>>>
> >>>>>>>>>> instance,
> >>>
> >>>>  it's
> >>>>>>>>>>> not
> >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
> >>>>>>>>>>>
> >>>>>>>>>> 0x00
> >>>>
> >>>>> so
> >>>>>
> >>>>>>  as
> >>>>>>>>>> to sort lexicographically earlier than any other value.
> >>>>>>>>>>
> >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Nick
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hiya Nick,
Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.


On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> I agree that a user-extensible interface is a required feature here.
> Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> keep in mind, though, that SQL and user applications are not the only
> consumers of this interface. A big motivation is allowing interop with the
> other higher MR languages. *cough* Where are my Pig and Hive peeps in this
> thread?
>
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > Maybe if we can keep nullability separate from the
> > serialization/deserialization, we can come up with a solution that works?
> > We're able to essentially infer that a column is null based on its value
> > being missing or empty. So if an iterator through the row key bytes could
> > detect/indicate that, then an application could "infer" the value is
> null.
> >
> > We're definitely planning on keeping byte[] accessors for use cases that
> > need it. I'm curious on the geographic data case, though, could you use a
> > fixed length long with a couple of new SQL built-ins to encode/decode the
> > latitude/longitude?
> >
> >
> > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> >
> >> Actually, that isn't all that far-fetched of a format Matt - pretty
> common
> >> anytime anyone wants to do sortable lat/long (*cough* three letter
> >> agencies
> >> cough*).
> >>
> >> Wouldn't we get the same by providing a simple set of libraries (ala
> >> orderly + other HBase useful things) and then still give access to the
> >> underlying byte array? Perhaps a nullable key type in that lib makes
> sense
> >> if lots of people need it and it would be nice to have standard
> libraries
> >> so tools could interop much more easily.
> >> -------------------
> >> Jesse Yates
> >> @jesse_yates
> >> jyates.github.com
> >>
> >>
> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >>
> >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
> >>> the
> >>> interfaces should be to provide first-class support for custom user
> types
> >>> in addition to the standard ones included.  Part of the power of
> hbase's
> >>> plain byte[] keys is that users can concoct the perfect key for their
> >>> data
> >>> type.  For example, I have a lot of geographic data where I interleave
> >>> latitude/longitude bits into a sortable 64 bit value that would
> probably
> >>> never be included in a standard library.
> >>>
> >>>
> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> >>> wrote:
> >>>
> >>>  I think having Int32, and NullableInt32 would support minimum
> overhead,
> >>>>
> >>> as
> >>>
> >>>> well as allowing SQL semantics.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>  Furthermore, is is more important to support null values than squeeze
> >>>>>
> >>>> all
> >>>
> >>>> representations into minimum size (4-bytes for int32, &c.)?
> >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> jtaylor@salesforce.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>   From the SQL perspective, handling null is important.
> >>>>>>>
> >>>>>>
> >>>>>>  From your perspective, it is critical to support NULLs, even at the
> >>>>>> expense of fixed-width encodings at all or supporting representation
> >>>>>>
> >>>>> of a
> >>>>
> >>>>> full range of values. That is, you'd rather be able to represent NULL
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> -2^31?
> >>>>>>
> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughtful response (and code!).
> >>>>>>>>
> >>>>>>>> I'm thinking I will press forward with a base implementation that
> >>>>>>>>
> >>>>>>> does
> >>>>
> >>>>>  not
> >>>>>>>> support nulls. The idea is to provide an extensible set of
> >>>>>>>>
> >>>>>>> interfaces,
> >>>>
> >>>>>  so I
> >>>>>>>> think this will not box us into a corner later. That is, a
> >>>>>>>>
> >>>>>>> mirroring
> >>>
> >>>>  package could be implemented that supports null values and accepts
> >>>>>>>> the relevant trade-offs.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   I spent some time this weekend extracting bits of our
> >>>>>>>>
> >>>>>>> serialization
> >>>
> >>>>  code to
> >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools
> <http://github.com/hotpads/**data-tools>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>> http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>>
> >>>>>>  .
> >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> >>>>>>>>>
> >>>>>>>> laying
> >>>
> >>>>  around.
> >>>>>>>>>
> >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>>>>>> *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> >>>>>>>>> **java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>
> >>>>  *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> >>>>>>>>> java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>
> >>>>  Looking back, I think my latest opinion on the topic is to reject
> >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> and
> >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>>>>>> LongArrayList
> >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
> >>>>>>>>>
> >>>>>>>> and
> >>>>
> >>>>>  costs
> >>>>>>>>> a little more in performance.  If the user can't find a pre-made
> >>>>>>>>>
> >>>>>>>> wrapper
> >>>>>
> >>>>>>  class, it's not very difficult for each user to provide their own
> >>>>>>>>> interpretation of null and check for it themselves.
> >>>>>>>>>
> >>>>>>>>> If you reject nullability, the question becomes what to do in
> >>>>>>>>>
> >>>>>>>> situations
> >>>>>
> >>>>>>  where you're implementing existing interfaces that accept nullable
> >>>>>>>>> params.
> >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> an
> >>>>>>>>> add(Long)
> >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> user
> >>>>>>>>>
> >>>>>>>> to
> >>>>
> >>>>>  make
> >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> >>>>>>>>>
> >>>>>>>> null.
> >>>
> >>>>
> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>>>>>> doug.meil@explorysmedical.com
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>> HmmmŠ good question.
> >>>>>>>>>>
> >>>>>>>>>> I think that fixed width support is important for a great many
> >>>>>>>>>>
> >>>>>>>>> rowkey
> >>>>
> >>>>>  constructs cases, so I'd rather see something like losing
> >>>>>>>>>>
> >>>>>>>>> MIN_VALUE
> >>>
> >>>> and
> >>>>>
> >>>>>>  keeping fixed width.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Heya,
> >>>>>>>>>>
> >>>>>>>>>>> Thinking about data types and serialization. I think null
> >>>>>>>>>>>
> >>>>>>>>>> support
> >>>
> >>>> is
> >>>>
> >>>>>  an
> >>>>>>>>>>> important characteristic for the serialized representations,
> >>>>>>>>>>> especially
> >>>>>>>>>>> when considering the compound type. However, doing so in
> >>>>>>>>>>>
> >>>>>>>>>> directly
> >>>
> >>>>  incompatible with fixed-width representations for numerics. For
> >>>>>>>>>>>
> >>>>>>>>>>>  instance,
> >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> >>>>>>>>>>
> >>>>>>>>> where
> >>>>
> >>>>>  do
> >>>>>>>>>>> you put null? float and double types can cheat a little by
> >>>>>>>>>>>
> >>>>>>>>>> folding
> >>>
> >>>>  negative
> >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> >>>>>>>>>>>
> >>>>>>>>>> strictly
> >>>>
> >>>>>  correct!), leaving a place to represent null. In the long
> >>>>>>>>>>>
> >>>>>>>>>> example
> >>>
> >>>>  case,
> >>>>>>>>>>> the
> >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> >>>>>>>>>>>
> >>>>>>>>>> one.
> >>>>
> >>>>>  This
> >>>>>>>>>>> will allocate an additional encoding which can be used for
> null.
> >>>>>>>>>>>
> >>>>>>>>>> My
> >>>>
> >>>>>  experience working with scientific data, however, makes me wince
> >>>>>>>>>>>
> >>>>>>>>>> at
> >>>>
> >>>>>  the
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> >>>>>>>>>>>
> >>>>>>>>>> already
> >>>>>
> >>>>>>  enough going on that it's simpler to make room.
> >>>>>>>>>>>
> >>>>>>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>>>>>>
> >>>>>>>>>> serialization.
> >>>>>
> >>>>>>  This
> >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> >>>>>>>>>>>
> >>>>>>>>>> instance,
> >>>
> >>>>  it's
> >>>>>>>>>>> not
> >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
> >>>>>>>>>>>
> >>>>>>>>>> 0x00
> >>>>
> >>>>> so
> >>>>>
> >>>>>>  as
> >>>>>>>>>> to sort lexicographically earlier than any other value.
> >>>>>>>>>>
> >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Nick
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hiya Nick,
Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.


On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> I agree that a user-extensible interface is a required feature here.
> Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> keep in mind, though, that SQL and user applications are not the only
> consumers of this interface. A big motivation is allowing interop with the
> other higher MR languages. *cough* Where are my Pig and Hive peeps in this
> thread?
>
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > Maybe if we can keep nullability separate from the
> > serialization/deserialization, we can come up with a solution that works?
> > We're able to essentially infer that a column is null based on its value
> > being missing or empty. So if an iterator through the row key bytes could
> > detect/indicate that, then an application could "infer" the value is
> null.
> >
> > We're definitely planning on keeping byte[] accessors for use cases that
> > need it. I'm curious on the geographic data case, though, could you use a
> > fixed length long with a couple of new SQL built-ins to encode/decode the
> > latitude/longitude?
> >
> >
> > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> >
> >> Actually, that isn't all that far-fetched of a format Matt - pretty
> common
> >> anytime anyone wants to do sortable lat/long (*cough* three letter
> >> agencies
> >> cough*).
> >>
> >> Wouldn't we get the same by providing a simple set of libraries (ala
> >> orderly + other HBase useful things) and then still give access to the
> >> underlying byte array? Perhaps a nullable key type in that lib makes
> sense
> >> if lots of people need it and it would be nice to have standard
> libraries
> >> so tools could interop much more easily.
> >> -------------------
> >> Jesse Yates
> >> @jesse_yates
> >> jyates.github.com
> >>
> >>
> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >>
> >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
> >>> the
> >>> interfaces should be to provide first-class support for custom user
> types
> >>> in addition to the standard ones included.  Part of the power of
> hbase's
> >>> plain byte[] keys is that users can concoct the perfect key for their
> >>> data
> >>> type.  For example, I have a lot of geographic data where I interleave
> >>> latitude/longitude bits into a sortable 64 bit value that would
> probably
> >>> never be included in a standard library.
> >>>
> >>>
> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> >>> wrote:
> >>>
> >>>  I think having Int32, and NullableInt32 would support minimum
> overhead,
> >>>>
> >>> as
> >>>
> >>>> well as allowing SQL semantics.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>  Furthermore, is is more important to support null values than squeeze
> >>>>>
> >>>> all
> >>>
> >>>> representations into minimum size (4-bytes for int32, &c.)?
> >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> jtaylor@salesforce.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>   From the SQL perspective, handling null is important.
> >>>>>>>
> >>>>>>
> >>>>>>  From your perspective, it is critical to support NULLs, even at the
> >>>>>> expense of fixed-width encodings at all or supporting representation
> >>>>>>
> >>>>> of a
> >>>>
> >>>>> full range of values. That is, you'd rather be able to represent NULL
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> -2^31?
> >>>>>>
> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughtful response (and code!).
> >>>>>>>>
> >>>>>>>> I'm thinking I will press forward with a base implementation that
> >>>>>>>>
> >>>>>>> does
> >>>>
> >>>>>  not
> >>>>>>>> support nulls. The idea is to provide an extensible set of
> >>>>>>>>
> >>>>>>> interfaces,
> >>>>
> >>>>>  so I
> >>>>>>>> think this will not box us into a corner later. That is, a
> >>>>>>>>
> >>>>>>> mirroring
> >>>
> >>>>  package could be implemented that supports null values and accepts
> >>>>>>>> the relevant trade-offs.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   I spent some time this weekend extracting bits of our
> >>>>>>>>
> >>>>>>> serialization
> >>>
> >>>>  code to
> >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools
> <http://github.com/hotpads/**data-tools>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>> http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>>
> >>>>>>  .
> >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> >>>>>>>>>
> >>>>>>>> laying
> >>>
> >>>>  around.
> >>>>>>>>>
> >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>>>>>> *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> >>>>>>>>> **java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>
> >>>>  *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> >>>>>>>>> java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>
> >>>>  Looking back, I think my latest opinion on the topic is to reject
> >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> and
> >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>>>>>> LongArrayList
> >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
> >>>>>>>>>
> >>>>>>>> and
> >>>>
> >>>>>  costs
> >>>>>>>>> a little more in performance.  If the user can't find a pre-made
> >>>>>>>>>
> >>>>>>>> wrapper
> >>>>>
> >>>>>>  class, it's not very difficult for each user to provide their own
> >>>>>>>>> interpretation of null and check for it themselves.
> >>>>>>>>>
> >>>>>>>>> If you reject nullability, the question becomes what to do in
> >>>>>>>>>
> >>>>>>>> situations
> >>>>>
> >>>>>>  where you're implementing existing interfaces that accept nullable
> >>>>>>>>> params.
> >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> an
> >>>>>>>>> add(Long)
> >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> user
> >>>>>>>>>
> >>>>>>>> to
> >>>>
> >>>>>  make
> >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> >>>>>>>>>
> >>>>>>>> null.
> >>>
> >>>>
> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>>>>>> doug.meil@explorysmedical.com
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>> HmmmŠ good question.
> >>>>>>>>>>
> >>>>>>>>>> I think that fixed width support is important for a great many
> >>>>>>>>>>
> >>>>>>>>> rowkey
> >>>>
> >>>>>  constructs cases, so I'd rather see something like losing
> >>>>>>>>>>
> >>>>>>>>> MIN_VALUE
> >>>
> >>>> and
> >>>>>
> >>>>>>  keeping fixed width.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Heya,
> >>>>>>>>>>
> >>>>>>>>>>> Thinking about data types and serialization. I think null
> >>>>>>>>>>>
> >>>>>>>>>> support
> >>>
> >>>> is
> >>>>
> >>>>>  an
> >>>>>>>>>>> important characteristic for the serialized representations,
> >>>>>>>>>>> especially
> >>>>>>>>>>> when considering the compound type. However, doing so in
> >>>>>>>>>>>
> >>>>>>>>>> directly
> >>>
> >>>>  incompatible with fixed-width representations for numerics. For
> >>>>>>>>>>>
> >>>>>>>>>>>  instance,
> >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> >>>>>>>>>>
> >>>>>>>>> where
> >>>>
> >>>>>  do
> >>>>>>>>>>> you put null? float and double types can cheat a little by
> >>>>>>>>>>>
> >>>>>>>>>> folding
> >>>
> >>>>  negative
> >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> >>>>>>>>>>>
> >>>>>>>>>> strictly
> >>>>
> >>>>>  correct!), leaving a place to represent null. In the long
> >>>>>>>>>>>
> >>>>>>>>>> example
> >>>
> >>>>  case,
> >>>>>>>>>>> the
> >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> >>>>>>>>>>>
> >>>>>>>>>> one.
> >>>>
> >>>>>  This
> >>>>>>>>>>> will allocate an additional encoding which can be used for
> null.
> >>>>>>>>>>>
> >>>>>>>>>> My
> >>>>
> >>>>>  experience working with scientific data, however, makes me wince
> >>>>>>>>>>>
> >>>>>>>>>> at
> >>>>
> >>>>>  the
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> >>>>>>>>>>>
> >>>>>>>>>> already
> >>>>>
> >>>>>>  enough going on that it's simpler to make room.
> >>>>>>>>>>>
> >>>>>>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>>>>>>
> >>>>>>>>>> serialization.
> >>>>>
> >>>>>>  This
> >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> >>>>>>>>>>>
> >>>>>>>>>> instance,
> >>>
> >>>>  it's
> >>>>>>>>>>> not
> >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
> >>>>>>>>>>>
> >>>>>>>>>> 0x00
> >>>>
> >>>>> so
> >>>>>
> >>>>>>  as
> >>>>>>>>>> to sort lexicographically earlier than any other value.
> >>>>>>>>>>
> >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Nick
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hiya Nick,
Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.


On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> I agree that a user-extensible interface is a required feature here.
> Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
> keep in mind, though, that SQL and user applications are not the only
> consumers of this interface. A big motivation is allowing interop with the
> other higher MR languages. *cough* Where are my Pig and Hive peeps in this
> thread?
>
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
>
> > Maybe if we can keep nullability separate from the
> > serialization/deserialization, we can come up with a solution that works?
> > We're able to essentially infer that a column is null based on its value
> > being missing or empty. So if an iterator through the row key bytes could
> > detect/indicate that, then an application could "infer" the value is
> null.
> >
> > We're definitely planning on keeping byte[] accessors for use cases that
> > need it. I'm curious on the geographic data case, though, could you use a
> > fixed length long with a couple of new SQL built-ins to encode/decode the
> > latitude/longitude?
> >
> >
> > On 04/01/2013 11:29 PM, Jesse Yates wrote:
> >
> >> Actually, that isn't all that far-fetched of a format Matt - pretty
> common
> >> anytime anyone wants to do sortable lat/long (*cough* three letter
> >> agencies
> >> cough*).
> >>
> >> Wouldn't we get the same by providing a simple set of libraries (ala
> >> orderly + other HBase useful things) and then still give access to the
> >> underlying byte array? Perhaps a nullable key type in that lib makes
> sense
> >> if lots of people need it and it would be nice to have standard
> libraries
> >> so tools could interop much more easily.
> >> -------------------
> >> Jesse Yates
> >> @jesse_yates
> >> jyates.github.com
> >>
> >>
> >> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >>
> >>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
> >>> the
> >>> interfaces should be to provide first-class support for custom user
> types
> >>> in addition to the standard ones included.  Part of the power of
> hbase's
> >>> plain byte[] keys is that users can concoct the perfect key for their
> >>> data
> >>> type.  For example, I have a lot of geographic data where I interleave
> >>> latitude/longitude bits into a sortable 64 bit value that would
> probably
> >>> never be included in a standard library.
> >>>
> >>>
> >>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
> >>> wrote:
> >>>
> >>>  I think having Int32, and NullableInt32 would support minimum
> overhead,
> >>>>
> >>> as
> >>>
> >>>> well as allowing SQL semantics.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>  Furthermore, is is more important to support null values than squeeze
> >>>>>
> >>>> all
> >>>
> >>>> representations into minimum size (4-bytes for int32, &c.)?
> >>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
> jtaylor@salesforce.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>   From the SQL perspective, handling null is important.
> >>>>>>>
> >>>>>>
> >>>>>>  From your perspective, it is critical to support NULLs, even at the
> >>>>>> expense of fixed-width encodings at all or supporting representation
> >>>>>>
> >>>>> of a
> >>>>
> >>>>> full range of values. That is, you'd rather be able to represent NULL
> >>>>>>
> >>>>> than
> >>>>>
> >>>>>> -2^31?
> >>>>>>
> >>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughtful response (and code!).
> >>>>>>>>
> >>>>>>>> I'm thinking I will press forward with a base implementation that
> >>>>>>>>
> >>>>>>> does
> >>>>
> >>>>>  not
> >>>>>>>> support nulls. The idea is to provide an extensible set of
> >>>>>>>>
> >>>>>>> interfaces,
> >>>>
> >>>>>  so I
> >>>>>>>> think this will not box us into a corner later. That is, a
> >>>>>>>>
> >>>>>>> mirroring
> >>>
> >>>>  package could be implemented that supports null values and accepts
> >>>>>>>> the relevant trade-offs.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
> >
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   I spent some time this weekend extracting bits of our
> >>>>>>>>
> >>>>>>> serialization
> >>>
> >>>>  code to
> >>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools
> <http://github.com/hotpads/**data-tools>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>> http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>>
> >>>>>>  .
> >>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
> >>>>>>>>>
> >>>>>>>> laying
> >>>
> >>>>  around.
> >>>>>>>>>
> >>>>>>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>>>>>> *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
> >>>>>>>>> **java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>
> >>>>  *
> >>>>>>>>>
> >>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
> >>>>>>>>> java<
> >>>>>>>>>
> >>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>
> >>>>  Looking back, I think my latest opinion on the topic is to reject
> >>>>>>>>> nullability as the rule since it can cause unexpected behavior
> and
> >>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>>>>>> LongArrayList
> >>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
> >>>>>>>>>
> >>>>>>>> and
> >>>>
> >>>>>  costs
> >>>>>>>>> a little more in performance.  If the user can't find a pre-made
> >>>>>>>>>
> >>>>>>>> wrapper
> >>>>>
> >>>>>>  class, it's not very difficult for each user to provide their own
> >>>>>>>>> interpretation of null and check for it themselves.
> >>>>>>>>>
> >>>>>>>>> If you reject nullability, the question becomes what to do in
> >>>>>>>>>
> >>>>>>>> situations
> >>>>>
> >>>>>>  where you're implementing existing interfaces that accept nullable
> >>>>>>>>> params.
> >>>>>>>>>    The LongArrayList above implements List<Long> which requires
> an
> >>>>>>>>> add(Long)
> >>>>>>>>> method.  In the above implementation I chose to swap nulls with
> >>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
> user
> >>>>>>>>>
> >>>>>>>> to
> >>>>
> >>>>>  make
> >>>>>>>>> that swap and then throw IllegalArgumentException if they pass
> >>>>>>>>>
> >>>>>>>> null.
> >>>
> >>>>
> >>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>>>>>> doug.meil@explorysmedical.com
> >>>>>>>>>
> >>>>>>>>>  wrote:
> >>>>>>>>>> HmmmŠ good question.
> >>>>>>>>>>
> >>>>>>>>>> I think that fixed width support is important for a great many
> >>>>>>>>>>
> >>>>>>>>> rowkey
> >>>>
> >>>>>  constructs cases, so I'd rather see something like losing
> >>>>>>>>>>
> >>>>>>>>> MIN_VALUE
> >>>
> >>>> and
> >>>>>
> >>>>>>  keeping fixed width.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Heya,
> >>>>>>>>>>
> >>>>>>>>>>> Thinking about data types and serialization. I think null
> >>>>>>>>>>>
> >>>>>>>>>> support
> >>>
> >>>> is
> >>>>
> >>>>>  an
> >>>>>>>>>>> important characteristic for the serialized representations,
> >>>>>>>>>>> especially
> >>>>>>>>>>> when considering the compound type. However, doing so in
> >>>>>>>>>>>
> >>>>>>>>>> directly
> >>>
> >>>>  incompatible with fixed-width representations for numerics. For
> >>>>>>>>>>>
> >>>>>>>>>>>  instance,
> >>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> >>>>>>>>>>
> >>>>>>>>> where
> >>>>
> >>>>>  do
> >>>>>>>>>>> you put null? float and double types can cheat a little by
> >>>>>>>>>>>
> >>>>>>>>>> folding
> >>>
> >>>>  negative
> >>>>>>>>>>> and positive NaN's into a single representation (this isn't
> >>>>>>>>>>>
> >>>>>>>>>> strictly
> >>>>
> >>>>>  correct!), leaving a place to represent null. In the long
> >>>>>>>>>>>
> >>>>>>>>>> example
> >>>
> >>>>  case,
> >>>>>>>>>>> the
> >>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> >>>>>>>>>>>
> >>>>>>>>>> one.
> >>>>
> >>>>>  This
> >>>>>>>>>>> will allocate an additional encoding which can be used for
> null.
> >>>>>>>>>>>
> >>>>>>>>>> My
> >>>>
> >>>>>  experience working with scientific data, however, makes me wince
> >>>>>>>>>>>
> >>>>>>>>>> at
> >>>>
> >>>>>  the
> >>>>>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> The variable-width encodings have it a little easier. There's
> >>>>>>>>>>>
> >>>>>>>>>> already
> >>>>>
> >>>>>>  enough going on that it's simpler to make room.
> >>>>>>>>>>>
> >>>>>>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>>>>>>
> >>>>>>>>>> serialization.
> >>>>>
> >>>>>>  This
> >>>>>>>>>>> imposes some limitations on our encoding strategies. For
> >>>>>>>>>>>
> >>>>>>>>>> instance,
> >>>
> >>>>  it's
> >>>>>>>>>>> not
> >>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
> >>>>>>>>>>>
> >>>>>>>>>> 0x00
> >>>>
> >>>>> so
> >>>>>
> >>>>>>  as
> >>>>>>>>>> to sort lexicographically earlier than any other value.
> >>>>>>>>>>
> >>>>>>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Nick
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?
> We're able to essentially infer that a column is null based on its value
> being missing or empty. So if an iterator through the row key bytes could
> detect/indicate that, then an application could "infer" the value is null.
>
> We're definitely planning on keeping byte[] accessors for use cases that
> need it. I'm curious on the geographic data case, though, could you use a
> fixed length long with a couple of new SQL built-ins to encode/decode the
> latitude/longitude?
>
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>> Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?
> We're able to essentially infer that a column is null based on its value
> being missing or empty. So if an iterator through the row key bytes could
> detect/indicate that, then an application could "infer" the value is null.
>
> We're definitely planning on keeping byte[] accessors for use cases that
> need it. I'm curious on the geographic data case, though, could you use a
> fixed length long with a couple of new SQL built-ins to encode/decode the
> latitude/longitude?
>
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>> Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?
> We're able to essentially infer that a column is null based on its value
> being missing or empty. So if an iterator through the row key bytes could
> detect/indicate that, then an application could "infer" the value is null.
>
> We're definitely planning on keeping byte[] accessors for use cases that
> need it. I'm curious on the geographic data case, though, could you use a
> fixed length long with a couple of new SQL built-ins to encode/decode the
> latitude/longitude?
>
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>> Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?
> We're able to essentially infer that a column is null based on its value
> being missing or empty. So if an iterator through the row key bytes could
> detect/indicate that, then an application could "infer" the value is null.
>
> We're definitely planning on keeping byte[] accessors for use cases that
> need it. I'm curious on the geographic data case, though, could you use a
> fixed length long with a couple of new SQL built-ins to encode/decode the
> latitude/longitude?
>
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>> Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Thu, Apr 4, 2013 at 7:18 PM, James Taylor <jt...@salesforce.com> wrote:

> Would it make sense to clean up the APIs a bit and post just the type
> system code somewhere to give us something to poke holes at?
>

That could be useful. I've been experimenting with implementations as I
update the spec doc and pushing as I go to
https://github.com/ndimiduk/serialization-play. I can make you a
collaborator or you can host your own repository, as you prefer.

On 04/04/2013 06:49 PM, Nick Dimiduk wrote:
>
>> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
>> >wrote:
>>
>>  Maybe if we can keep nullability separate from the
>>> serialization/deserialization, we can come up with a solution that works?
>>>
>>
>> I think implied null could work, but let's build out the matrix. I see two
>> kinds of types: fixed- and variable-width. These types are used in two
>> scenarios: on their own or as part of a compound type.
>>
>> A fixed-width type used standalone can enfer null from absence of a value.
>> When used in a compound type, absence isn't enough to indicate null unless
>> it's the last value in the sequence. To support a null field in the middle
>> of the compound type, it is forced to explicitly mark the field as null.
>> The only solution I can think of (without sacrificing the full value
>> range,
>> per my original question) is to write the full type width bytes, followed
>> by an isNull byte. Thus, for example, the INT type consumes 4 bytes when
>> serialized stand-alone, but 5 bytes when composed.
>>
>> James, how does Phoenix handle a null fixed-width rowkey component? I
>> don't
>> see that implemented in PDataType enum.
>>
>> Variable-width used standalone are simple enough because HBase handles
>> arbitrary length byte[]'s everywhere. Variable-width in composite is a
>> problem. Phoenix forces these value to only appear as the last position in
>> the composite, as I understand it. Orderly provides explicit null and
>> termination bytes by taking advantage of a feature of UTF-8 encoding.
>> Support for bytes is equally ugly (but clever) in that byte digits are
>> encoded in BCD. Both of these approaches bloat slightly the serialized
>> representation over the natural representation, but they allow the
>> variable-length types to be used anywhere within the compound type. As an
>> added bonus regarding code maintainability, their serialization entirely
>> self-contained within the type. That's in contrast to the fixed-width type
>> implementation described above, where null is explicitly encoded by the
>> compound type.
>>
>> My opinion is the computational and storage overhead imposed by Orderly's
>> implementation are worth the trade-off in flexibility in user consumption.
>> Correct me if i'm wrong James, but you're saying, from your experience
>> with
>> Phoenix, users are willing to work within that constraint?
>>
>> Thanks,
>> Nick
>>
>> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>>
>>   Actually, that isn't all that far-fetched of a format Matt - pretty
>> common
>>
>>>  anytime anyone wants to do sortable lat/long (*cough* three letter
>>>> agencies
>>>> cough*).
>>>>
>>>> Wouldn't we get the same by providing a simple set of libraries (ala
>>>> orderly + other HBase useful things) and then still give access to the
>>>> underlying byte array? Perhaps a nullable key type in that lib makes
>>>> sense
>>>> if lots of people need it and it would be nice to have standard
>>>> libraries
>>>> so tools could interop much more easily.
>>>> -------------------
>>>> Jesse Yates
>>>> @jesse_yates
>>>> jyates.github.com
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com>
>>>> wrote:
>>>>
>>>>   Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>>>
>>>>> the
>>>>> interfaces should be to provide first-class support for custom user
>>>>> types
>>>>> in addition to the standard ones included.  Part of the power of
>>>>> hbase's
>>>>> plain byte[] keys is that users can concoct the perfect key for their
>>>>> data
>>>>> type.  For example, I have a lot of geographic data where I interleave
>>>>> latitude/longitude bits into a sortable 64 bit value that would
>>>>> probably
>>>>> never be included in a standard library.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   I think having Int32, and NullableInt32 would support minimum
>>>>> overhead,
>>>>> as
>>>>>
>>>>>  well as allowing SQL semantics.
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Furthermore, is is more important to support null values than
>>>>>> squeeze
>>>>>> all
>>>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>>>
>>>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>
>>>>>>>   On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
>>>>>>> jtaylor@salesforce.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>    From the SQL perspective, handling null is important.
>>>>>>>>   From your perspective, it is critical to support NULLs, even at
>>>>>>>> the
>>>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>>>
>>>>>>>>  of a
>>>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>> than
>>>>>>>
>>>>>>>  -2^31?
>>>>>>>>
>>>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>>>
>>>>>>>>  Thanks for the thoughtful response (and code!).
>>>>>>>>>
>>>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>>>
>>>>>>>>>>  does
>>>>>>>>>
>>>>>>>>   not
>>>>>>>
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>>>
>>>>>>>>>>  interfaces,
>>>>>>>>>
>>>>>>>>   so I
>>>>>>>
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>>>
>>>>>>>>>>  mirroring
>>>>>>>>>
>>>>>>>>   package could be implemented that supports null values and accepts
>>>>>>
>>>>>>>  the relevant trade-offs.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
>>>>>>>>>> >
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>    I spent some time this weekend extracting bits of our
>>>>>>>>>>
>>>>>>>>>>  serialization
>>>>>>>>>
>>>>>>>>   code to
>>>>>>
>>>>>>>  a public github repo at http://github.com/hotpads/******data-tools<http://github.com/hotpads/****data-tools>
>>>>>>>>>>> <http://github.com/**hotpads/**data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>>> >
>>>>>>>>>>> <
>>>>>>>>>>>
>>>>>>>>>>>  http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>> <http://github.com/**hotpads/data-tools<http://github.com/hotpads/data-tools>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>   .
>>>>>>>>
>>>>>>>>>     Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>>>
>>>>>>>>>>>  laying
>>>>>>>>>>
>>>>>>>>>   around.
>>>>>>
>>>>>>>  You can see I've bumped into the NULL problem in a few places:
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**>
>>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>> >
>>>>>>>>>>> main/java/com/hotpads/data/******primitive/lists/**
>>>>>>>>>>> LongArrayList.**
>>>>>>>>>>> **java<
>>>>>>>>>>>
>>>>>>>>>>>  https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>> **java<https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
>>>>> LongArrayList.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> >
>>>>>
>>>>>    *
>>>>>>
>>>>>>>  https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**>
>>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>> >
>>>>>>>>>>> main/java/com/hotpads/data/******types/floats/DoubleByteTool.***
>>>>>>>>>>> ***
>>>>>>>>>>> java<
>>>>>>>>>>>
>>>>>>>>>>>  https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>> java<https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/types/floats/**
>>>>> DoubleByteTool.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> >
>>>>>
>>>>>    Looking back, I think my latest opinion on the topic is to reject
>>>>>>
>>>>>>>  nullability as the rule since it can cause unexpected behavior and
>>>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>>>> LongArrayList
>>>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>>>
>>>>>>>>>>>  and
>>>>>>>>>>
>>>>>>>>>   costs
>>>>>>>
>>>>>>>>  a little more in performance.  If the user can't find a pre-made
>>>>>>>>>>>
>>>>>>>>>>>  wrapper
>>>>>>>>>>
>>>>>>>>>   class, it's not very difficult for each user to provide their own
>>>>>>>>
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>>>
>>>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>>>
>>>>>>>>>>>  situations
>>>>>>>>>>
>>>>>>>>>   where you're implementing existing interfaces that accept
>>>>>>>> nullable
>>>>>>>>
>>>>>>>>> params.
>>>>>>>>>>>     The LongArrayList above implements List<Long> which requires
>>>>>>>>>>> an
>>>>>>>>>>> add(Long)
>>>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
>>>>>>>>>>> user
>>>>>>>>>>>
>>>>>>>>>>>  to
>>>>>>>>>>
>>>>>>>>>   make
>>>>>>>
>>>>>>>>  that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>>>
>>>>>>>>>>>  null.
>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>>>
>>>>>>>>>>>   wrote:
>>>>>>>>>>>
>>>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>>>
>>>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>>>
>>>>>>>>>>>>  rowkey
>>>>>>>>>>>
>>>>>>>>>>   constructs cases, so I'd rather see something like losing
>>>>>>>
>>>>>>>>  MIN_VALUE
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>>
>>>>>>>   keeping fixed width.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>    Heya,
>>>>>>>>>>>>
>>>>>>>>>>>>  Thinking about data types and serialization. I think null
>>>>>>>>>>>>>
>>>>>>>>>>>>>  support
>>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>
>>>>>>    an
>>>>>>>
>>>>>>>>  important characteristic for the serialized representations,
>>>>>>>>>>>>> especially
>>>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>>>
>>>>>>>>>>>>>  directly
>>>>>>>>>>>>
>>>>>>>>>>>   incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>>>>    instance,
>>>>>>>>>>>>>
>>>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>>>
>>>>>>>>>>>>  where
>>>>>>>>>>>
>>>>>>>>>>   do
>>>>>>>
>>>>>>>>  you put null? float and double types can cheat a little by
>>>>>>>>>>>>>
>>>>>>>>>>>>>  folding
>>>>>>>>>>>>
>>>>>>>>>>>   negative
>>>>>>
>>>>>>>  and positive NaN's into a single representation (this isn't
>>>>>>>>>>>>>
>>>>>>>>>>>>>  strictly
>>>>>>>>>>>>
>>>>>>>>>>>   correct!), leaving a place to represent null. In the long
>>>>>>>
>>>>>>>>  example
>>>>>>>>>>>>
>>>>>>>>>>>   case,
>>>>>>
>>>>>>>  the
>>>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>>>
>>>>>>>>>>>>>  one.
>>>>>>>>>>>>
>>>>>>>>>>>   This
>>>>>>>
>>>>>>>>  will allocate an additional encoding which can be used for null.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  My
>>>>>>>>>>>>
>>>>>>>>>>>   experience working with scientific data, however, makes me
>>>>>>> wince
>>>>>>>
>>>>>>>>  at
>>>>>>>>>>>>
>>>>>>>>>>>   the
>>>>>>>
>>>>>>>>  idea.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>>>
>>>>>>>>>>>>>  already
>>>>>>>>>>>>
>>>>>>>>>>>   enough going on that it's simpler to make room.
>>>>>>>>
>>>>>>>>>  Remember, the final goal is to support order-preserving
>>>>>>>>>>>>>
>>>>>>>>>>>>>  serialization.
>>>>>>>>>>>>
>>>>>>>>>>>   This
>>>>>>>>
>>>>>>>>>  imposes some limitations on our encoding strategies. For
>>>>>>>>>>>>>
>>>>>>>>>>>>>  instance,
>>>>>>>>>>>>
>>>>>>>>>>>   it's
>>>>>>
>>>>>>>  not
>>>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>>>
>>>>>>>>>>>>>  0x00
>>>>>>>>>>>>
>>>>>>>>>>> so
>>>>>>>
>>>>>>>    as
>>>>>>>>
>>>>>>>>>  to sort lexicographically earlier than any other value.
>>>>>>>>>>>>
>>>>>>>>>>>>  What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

With Phoenix, variable width types may be null in all cases (in the row 
key or as key values) and fixed width types may be null as key values or 
as the last row key column. We only allow a binary type in the row key 
as the last column. We haven't had any push back on these restrictions 
to date.

Would it make sense to clean up the APIs a bit and post just the type 
system code somewhere to give us something to poke holes at?

Thanks,

     James

On 04/04/2013 06:49 PM, Nick Dimiduk wrote:
> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:
>
>> Maybe if we can keep nullability separate from the
>> serialization/deserialization, we can come up with a solution that works?
>
> I think implied null could work, but let's build out the matrix. I see two
> kinds of types: fixed- and variable-width. These types are used in two
> scenarios: on their own or as part of a compound type.
>
> A fixed-width type used standalone can enfer null from absence of a value.
> When used in a compound type, absence isn't enough to indicate null unless
> it's the last value in the sequence. To support a null field in the middle
> of the compound type, it is forced to explicitly mark the field as null.
> The only solution I can think of (without sacrificing the full value range,
> per my original question) is to write the full type width bytes, followed
> by an isNull byte. Thus, for example, the INT type consumes 4 bytes when
> serialized stand-alone, but 5 bytes when composed.
>
> James, how does Phoenix handle a null fixed-width rowkey component? I don't
> see that implemented in PDataType enum.
>
> Variable-width used standalone are simple enough because HBase handles
> arbitrary length byte[]'s everywhere. Variable-width in composite is a
> problem. Phoenix forces these value to only appear as the last position in
> the composite, as I understand it. Orderly provides explicit null and
> termination bytes by taking advantage of a feature of UTF-8 encoding.
> Support for bytes is equally ugly (but clever) in that byte digits are
> encoded in BCD. Both of these approaches bloat slightly the serialized
> representation over the natural representation, but they allow the
> variable-length types to be used anywhere within the compound type. As an
> added bonus regarding code maintainability, their serialization entirely
> self-contained within the type. That's in contrast to the fixed-width type
> implementation described above, where null is explicitly encoded by the
> compound type.
>
> My opinion is the computational and storage overhead imposed by Orderly's
> implementation are worth the trade-off in flexibility in user consumption.
> Correct me if i'm wrong James, but you're saying, from your experience with
> Phoenix, users are willing to work within that constraint?
>
> Thanks,
> Nick
>
> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>
>   Actually, that isn't all that far-fetched of a format Matt - pretty common
>>> anytime anyone wants to do sortable lat/long (*cough* three letter
>>> agencies
>>> cough*).
>>>
>>> Wouldn't we get the same by providing a simple set of libraries (ala
>>> orderly + other HBase useful things) and then still give access to the
>>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>>> if lots of people need it and it would be nice to have standard libraries
>>> so tools could interop much more easily.
>>> -------------------
>>> Jesse Yates
>>> @jesse_yates
>>> jyates.github.com
>>>
>>>
>>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>>
>>>   Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>>> the
>>>> interfaces should be to provide first-class support for custom user types
>>>> in addition to the standard ones included.  Part of the power of hbase's
>>>> plain byte[] keys is that users can concoct the perfect key for their
>>>> data
>>>> type.  For example, I have a lot of geographic data where I interleave
>>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>>> never be included in a standard library.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>>> wrote:
>>>>
>>>>   I think having Int32, and NullableInt32 would support minimum overhead,
>>>> as
>>>>
>>>>> well as allowing SQL semantics.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Furthermore, is is more important to support null values than squeeze
>>>>> all
>>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>
>>>>>>   On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>    From the SQL perspective, handling null is important.
>>>>>>>   From your perspective, it is critical to support NULLs, even at the
>>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>>
>>>>>> of a
>>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>> than
>>>>>>
>>>>>>> -2^31?
>>>>>>>
>>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>>
>>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>>
>>>>>>>> does
>>>>>>   not
>>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>>
>>>>>>>> interfaces,
>>>>>>   so I
>>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>>
>>>>>>>> mirroring
>>>>>   package could be implemented that supports null values and accepts
>>>>>>>>> the relevant trade-offs.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    I spent some time this weekend extracting bits of our
>>>>>>>>>
>>>>>>>> serialization
>>>>>   code to
>>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>>>>   .
>>>>>>>>>>     Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>>
>>>>>>>>> laying
>>>>>   around.
>>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>>> **java<
>>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>
>>>>>   *
>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>>> java<
>>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>>>   Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>>> LongArrayList
>>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>>
>>>>>>>>> and
>>>>>>   costs
>>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>>
>>>>>>>>> wrapper
>>>>>>>   class, it's not very difficult for each user to provide their own
>>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>>
>>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>>
>>>>>>>>> situations
>>>>>>>   where you're implementing existing interfaces that accept nullable
>>>>>>>>>> params.
>>>>>>>>>>     The LongArrayList above implements List<Long> which requires an
>>>>>>>>>> add(Long)
>>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>>
>>>>>>>>> to
>>>>>>   make
>>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>>
>>>>>>>>> null.
>>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>>
>>>>>>>>>>   wrote:
>>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>>
>>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>>
>>>>>>>>>> rowkey
>>>>>>   constructs cases, so I'd rather see something like losing
>>>>>>>>>> MIN_VALUE
>>>>> and
>>>>>>>   keeping fixed width.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>    Heya,
>>>>>>>>>>>
>>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>>
>>>>>>>>>>> support
>>>>> is
>>>>>
>>>>>>   an
>>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>>> especially
>>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>>
>>>>>>>>>>> directly
>>>>>   incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>>   instance,
>>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>>
>>>>>>>>>> where
>>>>>>   do
>>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>>
>>>>>>>>>>> folding
>>>>>   negative
>>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>>
>>>>>>>>>>> strictly
>>>>>>   correct!), leaving a place to represent null. In the long
>>>>>>>>>>> example
>>>>>   case,
>>>>>>>>>>>> the
>>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>>
>>>>>>>>>>> one.
>>>>>>   This
>>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>>
>>>>>>>>>>> My
>>>>>>   experience working with scientific data, however, makes me wince
>>>>>>>>>>> at
>>>>>>   the
>>>>>>>>>>>> idea.
>>>>>>>>>>>>
>>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>>
>>>>>>>>>>> already
>>>>>>>   enough going on that it's simpler to make room.
>>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>>
>>>>>>>>>>> serialization.
>>>>>>>   This
>>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>>
>>>>>>>>>>> instance,
>>>>>   it's
>>>>>>>>>>>> not
>>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>>
>>>>>>>>>>> 0x00
>>>>>> so
>>>>>>
>>>>>>>   as
>>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>>
>>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jt...@salesforce.com>wrote:

> Maybe if we can keep nullability separate from the
> serialization/deserialization, we can come up with a solution that works?


I think implied null could work, but let's build out the matrix. I see two
kinds of types: fixed- and variable-width. These types are used in two
scenarios: on their own or as part of a compound type.

A fixed-width type used standalone can enfer null from absence of a value.
When used in a compound type, absence isn't enough to indicate null unless
it's the last value in the sequence. To support a null field in the middle
of the compound type, it is forced to explicitly mark the field as null.
The only solution I can think of (without sacrificing the full value range,
per my original question) is to write the full type width bytes, followed
by an isNull byte. Thus, for example, the INT type consumes 4 bytes when
serialized stand-alone, but 5 bytes when composed.

James, how does Phoenix handle a null fixed-width rowkey component? I don't
see that implemented in PDataType enum.

Variable-width used standalone are simple enough because HBase handles
arbitrary length byte[]'s everywhere. Variable-width in composite is a
problem. Phoenix forces these value to only appear as the last position in
the composite, as I understand it. Orderly provides explicit null and
termination bytes by taking advantage of a feature of UTF-8 encoding.
Support for bytes is equally ugly (but clever) in that byte digits are
encoded in BCD. Both of these approaches bloat slightly the serialized
representation over the natural representation, but they allow the
variable-length types to be used anywhere within the compound type. As an
added bonus regarding code maintainability, their serialization entirely
self-contained within the type. That's in contrast to the fixed-width type
implementation described above, where null is explicitly encoded by the
compound type.

My opinion is the computational and storage overhead imposed by Orderly's
implementation are worth the trade-off in flexibility in user consumption.
Correct me if i'm wrong James, but you're saying, from your experience with
Phoenix, users are willing to work within that constraint?

Thanks,
Nick

On 04/01/2013 11:29 PM, Jesse Yates wrote:

 Actually, that isn't all that far-fetched of a format Matt - pretty common
>> anytime anyone wants to do sortable lat/long (*cough* three letter
>> agencies
>> cough*).
>>
>> Wouldn't we get the same by providing a simple set of libraries (ala
>> orderly + other HBase useful things) and then still give access to the
>> underlying byte array? Perhaps a nullable key type in that lib makes sense
>> if lots of people need it and it would be nice to have standard libraries
>> so tools could interop much more easily.
>> -------------------
>> Jesse Yates
>> @jesse_yates
>> jyates.github.com
>>
>>
>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>> the
>>> interfaces should be to provide first-class support for custom user types
>>> in addition to the standard ones included.  Part of the power of hbase's
>>> plain byte[] keys is that users can concoct the perfect key for their
>>> data
>>> type.  For example, I have a lot of geographic data where I interleave
>>> latitude/longitude bits into a sortable 64 bit value that would probably
>>> never be included in a standard library.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com>
>>> wrote:
>>>
>>>  I think having Int32, and NullableInt32 would support minimum overhead,
>>>>
>>> as
>>>
>>>> well as allowing SQL semantics.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>
>>>>  Furthermore, is is more important to support null values than squeeze
>>>>>
>>>> all
>>>
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>>> wrote:
>>>>>>
>>>>>>   From the SQL perspective, handling null is important.
>>>>>>>
>>>>>>
>>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>
>>>>> of a
>>>>
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>
>>>>> than
>>>>>
>>>>>> -2^31?
>>>>>>
>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>>
>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>
>>>>>>> does
>>>>
>>>>>  not
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>
>>>>>>> interfaces,
>>>>
>>>>>  so I
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>
>>>>>>> mirroring
>>>
>>>>  package could be implemented that supports null values and accepts
>>>>>>>> the relevant trade-offs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   I spent some time this weekend extracting bits of our
>>>>>>>>
>>>>>>> serialization
>>>
>>>>  code to
>>>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>> <
>>>>>>>>>
>>>>>>>> http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>>
>>>>>>  .
>>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>
>>>>>>>> laying
>>>
>>>>  around.
>>>>>>>>>
>>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>>>>>> **java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>
>>>>  *
>>>>>>>>>
>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>>>>>> java<
>>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>>>  Looking back, I think my latest opinion on the topic is to reject
>>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>> LongArrayList
>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>>  costs
>>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>>>>
>>>>>>>> wrapper
>>>>>
>>>>>>  class, it's not very difficult for each user to provide their own
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>
>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>
>>>>>>>> situations
>>>>>
>>>>>>  where you're implementing existing interfaces that accept nullable
>>>>>>>>> params.
>>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>>> add(Long)
>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>>>>>>>>
>>>>>>>> to
>>>>
>>>>>  make
>>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>
>>>>>>>> null.
>>>
>>>>
>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>
>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>
>>>>>>>>> rowkey
>>>>
>>>>>  constructs cases, so I'd rather see something like losing
>>>>>>>>>>
>>>>>>>>> MIN_VALUE
>>>
>>>> and
>>>>>
>>>>>>  keeping fixed width.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>   Heya,
>>>>>>>>>>
>>>>>>>>>>> Thinking about data types and serialization. I think null
>>>>>>>>>>>
>>>>>>>>>> support
>>>
>>>> is
>>>>
>>>>>  an
>>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>>> especially
>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>
>>>>>>>>>> directly
>>>
>>>>  incompatible with fixed-width representations for numerics. For
>>>>>>>>>>>
>>>>>>>>>>>  instance,
>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>
>>>>>>>>> where
>>>>
>>>>>  do
>>>>>>>>>>> you put null? float and double types can cheat a little by
>>>>>>>>>>>
>>>>>>>>>> folding
>>>
>>>>  negative
>>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>>>>>>>>>>
>>>>>>>>>> strictly
>>>>
>>>>>  correct!), leaving a place to represent null. In the long
>>>>>>>>>>>
>>>>>>>>>> example
>>>
>>>>  case,
>>>>>>>>>>> the
>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>
>>>>>>>>>> one.
>>>>
>>>>>  This
>>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>>>>>>>>>>
>>>>>>>>>> My
>>>>
>>>>>  experience working with scientific data, however, makes me wince
>>>>>>>>>>>
>>>>>>>>>> at
>>>>
>>>>>  the
>>>>>>>>>>> idea.
>>>>>>>>>>>
>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>
>>>>>>>>>> already
>>>>>
>>>>>>  enough going on that it's simpler to make room.
>>>>>>>>>>>
>>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>>>>
>>>>>>>>>> serialization.
>>>>>
>>>>>>  This
>>>>>>>>>>> imposes some limitations on our encoding strategies. For
>>>>>>>>>>>
>>>>>>>>>> instance,
>>>
>>>>  it's
>>>>>>>>>>> not
>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>
>>>>>>>>>> 0x00
>>>>
>>>>> so
>>>>>
>>>>>>  as
>>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>>
>>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

Maybe if we can keep nullability separate from the 
serialization/deserialization, we can come up with a solution that 
works? We're able to essentially infer that a column is null based on 
its value being missing or empty. So if an iterator through the row key 
bytes could detect/indicate that, then an application could "infer" the 
value is null.

We're definitely planning on keeping byte[] accessors for use cases that 
need it. I'm curious on the geographic data case, though, could you use 
a fixed length long with a couple of new SQL built-ins to encode/decode 
the latitude/longitude?

On 04/01/2013 11:29 PM, Jesse Yates wrote:
> Actually, that isn't all that far-fetched of a format Matt - pretty common
> anytime anyone wants to do sortable lat/long (*cough* three letter agencies
> cough*).
>
> Wouldn't we get the same by providing a simple set of libraries (ala
> orderly + other HBase useful things) and then still give access to the
> underlying byte array? Perhaps a nullable key type in that lib makes sense
> if lots of people need it and it would be nice to have standard libraries
> so tools could interop much more easily.
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
>> interfaces should be to provide first-class support for custom user types
>> in addition to the standard ones included.  Part of the power of hbase's
>> plain byte[] keys is that users can concoct the perfect key for their data
>> type.  For example, I have a lot of geographic data where I interleave
>> latitude/longitude bits into a sortable 64 bit value that would probably
>> never be included in a standard library.
>>
>>
>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com> wrote:
>>
>>> I think having Int32, and NullableInt32 would support minimum overhead,
>> as
>>> well as allowing SQL semantics.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>>>
>>>> Furthermore, is is more important to support null values than squeeze
>> all
>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>
>>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
>>>>> wrote:
>>>>>
>>>>>>  From the SQL perspective, handling null is important.
>>>>>
>>>>>  From your perspective, it is critical to support NULLs, even at the
>>>>> expense of fixed-width encodings at all or supporting representation
>>> of a
>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>> than
>>>>> -2^31?
>>>>>
>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>> Thanks for the thoughtful response (and code!).
>>>>>>>
>>>>>>> I'm thinking I will press forward with a base implementation that
>>> does
>>>>>>> not
>>>>>>> support nulls. The idea is to provide an extensible set of
>>> interfaces,
>>>>>>> so I
>>>>>>> think this will not box us into a corner later. That is, a
>> mirroring
>>>>>>> package could be implemented that supports null values and accepts
>>>>>>> the relevant trade-offs.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   I spent some time this weekend extracting bits of our
>> serialization
>>>>>>>> code to
>>>>>>>> a public github repo at http://github.com/hotpads/**data-tools<
>>>> http://github.com/hotpads/data-tools>
>>>>>>>> .
>>>>>>>>    Contributions are welcome - i'm sure we all have this stuff
>> laying
>>>>>>>> around.
>>>>>>>>
>>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>>> *
>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
>>>>>>>> *
>>>>>>>>
>>>>>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>>>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>>>>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>> LongArrayList
>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>> and
>>>>>>>> costs
>>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>> wrapper
>>>>>>>> class, it's not very difficult for each user to provide their own
>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>
>>>>>>>> If you reject nullability, the question becomes what to do in
>>>> situations
>>>>>>>> where you're implementing existing interfaces that accept nullable
>>>>>>>> params.
>>>>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>>>>> add(Long)
>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
>>> to
>>>>>>>> make
>>>>>>>> that swap and then throw IllegalArgumentException if they pass
>> null.
>>>>>>>>
>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>> doug.meil@explorysmedical.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>> HmmmŠ good question.
>>>>>>>>>
>>>>>>>>> I think that fixed width support is important for a great many
>>> rowkey
>>>>>>>>> constructs cases, so I'd rather see something like losing
>> MIN_VALUE
>>>> and
>>>>>>>>> keeping fixed width.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>   Heya,
>>>>>>>>>> Thinking about data types and serialization. I think null
>> support
>>> is
>>>>>>>>>> an
>>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>>> especially
>>>>>>>>>> when considering the compound type. However, doing so in
>> directly
>>>>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>>>>>
>>>>>>>>> instance,
>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>> where
>>>>>>>>>> do
>>>>>>>>>> you put null? float and double types can cheat a little by
>> folding
>>>>>>>>>> negative
>>>>>>>>>> and positive NaN's into a single representation (this isn't
>>> strictly
>>>>>>>>>> correct!), leaving a place to represent null. In the long
>> example
>>>>>>>>>> case,
>>>>>>>>>> the
>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>> one.
>>>>>>>>>> This
>>>>>>>>>> will allocate an additional encoding which can be used for null.
>>> My
>>>>>>>>>> experience working with scientific data, however, makes me wince
>>> at
>>>>>>>>>> the
>>>>>>>>>> idea.
>>>>>>>>>>
>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>> already
>>>>>>>>>> enough going on that it's simpler to make room.
>>>>>>>>>>
>>>>>>>>>> Remember, the final goal is to support order-preserving
>>>> serialization.
>>>>>>>>>> This
>>>>>>>>>> imposes some limitations on our encoding strategies. For
>> instance,
>>>>>>>>>> it's
>>>>>>>>>> not
>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>> 0x00
>>>> so
>>>>>>>>> as
>>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: HBase Types: Explicit Null Support

Posted by Jesse Yates <je...@gmail.com>.

Actually, that isn't all that far-fetched of a format Matt - pretty common
anytime anyone wants to do sortable lat/long (*cough* three letter agencies
cough*).

Wouldn't we get the same by providing a simple set of libraries (ala
orderly + other HBase useful things) and then still give access to the
underlying byte array? Perhaps a nullable key type in that lib makes sense
if lots of people need it and it would be nice to have standard libraries
so tools could interop much more easily.
-------------------
Jesse Yates
@jesse_yates
jyates.github.com


On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mc...@hotpads.com> wrote:

> Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
> interfaces should be to provide first-class support for custom user types
> in addition to the standard ones included.  Part of the power of hbase's
> plain byte[] keys is that users can concoct the perfect key for their data
> type.  For example, I have a lot of geographic data where I interleave
> latitude/longitude bits into a sortable 64 bit value that would probably
> never be included in a standard library.
>
>
> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com> wrote:
>
> > I think having Int32, and NullableInt32 would support minimum overhead,
> as
> > well as allowing SQL semantics.
> >
> >
> > On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> >
> > > Furthermore, is is more important to support null values than squeeze
> all
> > > representations into minimum size (4-bytes for int32, &c.)?
> > > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >
> > > > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
> > > >wrote:
> > > >
> > > >> From the SQL perspective, handling null is important.
> > > >
> > > >
> > > > From your perspective, it is critical to support NULLs, even at the
> > > > expense of fixed-width encodings at all or supporting representation
> > of a
> > > > full range of values. That is, you'd rather be able to represent NULL
> > > than
> > > > -2^31?
> > > >
> > > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > > >>
> > > >>> Thanks for the thoughtful response (and code!).
> > > >>>
> > > >>> I'm thinking I will press forward with a base implementation that
> > does
> > > >>> not
> > > >>> support nulls. The idea is to provide an extensible set of
> > interfaces,
> > > >>> so I
> > > >>> think this will not box us into a corner later. That is, a
> mirroring
> > > >>> package could be implemented that supports null values and accepts
> > > >>> the relevant trade-offs.
> > > >>>
> > > >>> Thanks,
> > > >>> Nick
> > > >>>
> > > >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> > > >>> wrote:
> > > >>>
> > > >>>  I spent some time this weekend extracting bits of our
> serialization
> > > >>>> code to
> > > >>>> a public github repo at http://github.com/hotpads/**data-tools<
> > > http://github.com/hotpads/data-tools>
> > > >>>> .
> > > >>>>   Contributions are welcome - i'm sure we all have this stuff
> laying
> > > >>>> around.
> > > >>>>
> > > >>>> You can see I've bumped into the NULL problem in a few places:
> > > >>>> *
> > > >>>>
> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > > >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> > >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > > >
> > > >>>> *
> > > >>>>
> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > > >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> > >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > > >
> > > >>>>
> > > >>>> Looking back, I think my latest opinion on the topic is to reject
> > > >>>> nullability as the rule since it can cause unexpected behavior and
> > > >>>> confusion.  It's cleaner to provide a wrapper class (so both
> > > >>>> LongArrayList
> > > >>>> plus NullableLongArrayList) that explicitly defines the behavior,
> > and
> > > >>>> costs
> > > >>>> a little more in performance.  If the user can't find a pre-made
> > > wrapper
> > > >>>> class, it's not very difficult for each user to provide their own
> > > >>>> interpretation of null and check for it themselves.
> > > >>>>
> > > >>>> If you reject nullability, the question becomes what to do in
> > > situations
> > > >>>> where you're implementing existing interfaces that accept nullable
> > > >>>> params.
> > > >>>>   The LongArrayList above implements List<Long> which requires an
> > > >>>> add(Long)
> > > >>>> method.  In the above implementation I chose to swap nulls with
> > > >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
> > to
> > > >>>> make
> > > >>>> that swap and then throw IllegalArgumentException if they pass
> null.
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > > >>>> doug.meil@explorysmedical.com
> > > >>>>
> > > >>>>> wrote:
> > > >>>>> HmmmŠ good question.
> > > >>>>>
> > > >>>>> I think that fixed width support is important for a great many
> > rowkey
> > > >>>>> constructs cases, so I'd rather see something like losing
> MIN_VALUE
> > > and
> > > >>>>> keeping fixed width.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > > >>>>>
> > > >>>>>  Heya,
> > > >>>>>>
> > > >>>>>> Thinking about data types and serialization. I think null
> support
> > is
> > > >>>>>> an
> > > >>>>>> important characteristic for the serialized representations,
> > > >>>>>> especially
> > > >>>>>> when considering the compound type. However, doing so in
> directly
> > > >>>>>> incompatible with fixed-width representations for numerics. For
> > > >>>>>>
> > > >>>>> instance,
> > > >>>>
> > > >>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> > where
> > > >>>>>> do
> > > >>>>>> you put null? float and double types can cheat a little by
> folding
> > > >>>>>> negative
> > > >>>>>> and positive NaN's into a single representation (this isn't
> > strictly
> > > >>>>>> correct!), leaving a place to represent null. In the long
> example
> > > >>>>>> case,
> > > >>>>>> the
> > > >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> > one.
> > > >>>>>> This
> > > >>>>>> will allocate an additional encoding which can be used for null.
> > My
> > > >>>>>> experience working with scientific data, however, makes me wince
> > at
> > > >>>>>> the
> > > >>>>>> idea.
> > > >>>>>>
> > > >>>>>> The variable-width encodings have it a little easier. There's
> > > already
> > > >>>>>> enough going on that it's simpler to make room.
> > > >>>>>>
> > > >>>>>> Remember, the final goal is to support order-preserving
> > > serialization.
> > > >>>>>> This
> > > >>>>>> imposes some limitations on our encoding strategies. For
> instance,
> > > >>>>>> it's
> > > >>>>>> not
> > > >>>>>> enough to simply encode null, it really needs to be encoded as
> > 0x00
> > > so
> > > >>>>>>
> > > >>>>> as
> > > >>>>
> > > >>>>> to sort lexicographically earlier than any other value.
> > > >>>>>>
> > > >>>>>> What do you think? Any ideas, experiences, etc?
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Nick
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>
> > > >
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
interfaces should be to provide first-class support for custom user types
in addition to the standard ones included.  Part of the power of hbase's
plain byte[] keys is that users can concoct the perfect key for their data
type.  For example, I have a lot of geographic data where I interleave
latitude/longitude bits into a sortable 64 bit value that would probably
never be included in a standard library.


On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com> wrote:

> I think having Int32, and NullableInt32 would support minimum overhead, as
> well as allowing SQL semantics.
>
>
> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > Furthermore, is is more important to support null values than squeeze all
> > representations into minimum size (4-bytes for int32, &c.)?
> > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >
> > > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> > >
> > >> From the SQL perspective, handling null is important.
> > >
> > >
> > > From your perspective, it is critical to support NULLs, even at the
> > > expense of fixed-width encodings at all or supporting representation
> of a
> > > full range of values. That is, you'd rather be able to represent NULL
> > than
> > > -2^31?
> > >
> > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>
> > >>> Thanks for the thoughtful response (and code!).
> > >>>
> > >>> I'm thinking I will press forward with a base implementation that
> does
> > >>> not
> > >>> support nulls. The idea is to provide an extensible set of
> interfaces,
> > >>> so I
> > >>> think this will not box us into a corner later. That is, a mirroring
> > >>> package could be implemented that supports null values and accepts
> > >>> the relevant trade-offs.
> > >>>
> > >>> Thanks,
> > >>> Nick
> > >>>
> > >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> > >>> wrote:
> > >>>
> > >>>  I spent some time this weekend extracting bits of our serialization
> > >>>> code to
> > >>>> a public github repo at http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>> .
> > >>>>   Contributions are welcome - i'm sure we all have this stuff laying
> > >>>> around.
> > >>>>
> > >>>> You can see I've bumped into the NULL problem in a few places:
> > >>>> *
> > >>>>
> > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>> *
> > >>>>
> > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>>
> > >>>> Looking back, I think my latest opinion on the topic is to reject
> > >>>> nullability as the rule since it can cause unexpected behavior and
> > >>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>> LongArrayList
> > >>>> plus NullableLongArrayList) that explicitly defines the behavior,
> and
> > >>>> costs
> > >>>> a little more in performance.  If the user can't find a pre-made
> > wrapper
> > >>>> class, it's not very difficult for each user to provide their own
> > >>>> interpretation of null and check for it themselves.
> > >>>>
> > >>>> If you reject nullability, the question becomes what to do in
> > situations
> > >>>> where you're implementing existing interfaces that accept nullable
> > >>>> params.
> > >>>>   The LongArrayList above implements List<Long> which requires an
> > >>>> add(Long)
> > >>>> method.  In the above implementation I chose to swap nulls with
> > >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
> to
> > >>>> make
> > >>>> that swap and then throw IllegalArgumentException if they pass null.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>> doug.meil@explorysmedical.com
> > >>>>
> > >>>>> wrote:
> > >>>>> HmmmŠ good question.
> > >>>>>
> > >>>>> I think that fixed width support is important for a great many
> rowkey
> > >>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
> > and
> > >>>>> keeping fixed width.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  Heya,
> > >>>>>>
> > >>>>>> Thinking about data types and serialization. I think null support
> is
> > >>>>>> an
> > >>>>>> important characteristic for the serialized representations,
> > >>>>>> especially
> > >>>>>> when considering the compound type. However, doing so in directly
> > >>>>>> incompatible with fixed-width representations for numerics. For
> > >>>>>>
> > >>>>> instance,
> > >>>>
> > >>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> where
> > >>>>>> do
> > >>>>>> you put null? float and double types can cheat a little by folding
> > >>>>>> negative
> > >>>>>> and positive NaN's into a single representation (this isn't
> strictly
> > >>>>>> correct!), leaving a place to represent null. In the long example
> > >>>>>> case,
> > >>>>>> the
> > >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> one.
> > >>>>>> This
> > >>>>>> will allocate an additional encoding which can be used for null.
> My
> > >>>>>> experience working with scientific data, however, makes me wince
> at
> > >>>>>> the
> > >>>>>> idea.
> > >>>>>>
> > >>>>>> The variable-width encodings have it a little easier. There's
> > already
> > >>>>>> enough going on that it's simpler to make room.
> > >>>>>>
> > >>>>>> Remember, the final goal is to support order-preserving
> > serialization.
> > >>>>>> This
> > >>>>>> imposes some limitations on our encoding strategies. For instance,
> > >>>>>> it's
> > >>>>>> not
> > >>>>>> enough to simply encode null, it really needs to be encoded as
> 0x00
> > so
> > >>>>>>
> > >>>>> as
> > >>>>
> > >>>>> to sort lexicographically earlier than any other value.
> > >>>>>>
> > >>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Nick
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
interfaces should be to provide first-class support for custom user types
in addition to the standard ones included.  Part of the power of hbase's
plain byte[] keys is that users can concoct the perfect key for their data
type.  For example, I have a lot of geographic data where I interleave
latitude/longitude bits into a sortable 64 bit value that would probably
never be included in a standard library.


On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <en...@gmail.com> wrote:

> I think having Int32, and NullableInt32 would support minimum overhead, as
> well as allowing SQL semantics.
>
>
> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:
>
> > Furthermore, is is more important to support null values than squeeze all
> > representations into minimum size (4-bytes for int32, &c.)?
> > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >
> > > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
> > >wrote:
> > >
> > >> From the SQL perspective, handling null is important.
> > >
> > >
> > > From your perspective, it is critical to support NULLs, even at the
> > > expense of fixed-width encodings at all or supporting representation
> of a
> > > full range of values. That is, you'd rather be able to represent NULL
> > than
> > > -2^31?
> > >
> > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > >>
> > >>> Thanks for the thoughtful response (and code!).
> > >>>
> > >>> I'm thinking I will press forward with a base implementation that
> does
> > >>> not
> > >>> support nulls. The idea is to provide an extensible set of
> interfaces,
> > >>> so I
> > >>> think this will not box us into a corner later. That is, a mirroring
> > >>> package could be implemented that supports null values and accepts
> > >>> the relevant trade-offs.
> > >>>
> > >>> Thanks,
> > >>> Nick
> > >>>
> > >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> > >>> wrote:
> > >>>
> > >>>  I spent some time this weekend extracting bits of our serialization
> > >>>> code to
> > >>>> a public github repo at http://github.com/hotpads/**data-tools<
> > http://github.com/hotpads/data-tools>
> > >>>> .
> > >>>>   Contributions are welcome - i'm sure we all have this stuff laying
> > >>>> around.
> > >>>>
> > >>>> You can see I've bumped into the NULL problem in a few places:
> > >>>> *
> > >>>>
> > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > >
> > >>>> *
> > >>>>
> > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > >
> > >>>>
> > >>>> Looking back, I think my latest opinion on the topic is to reject
> > >>>> nullability as the rule since it can cause unexpected behavior and
> > >>>> confusion.  It's cleaner to provide a wrapper class (so both
> > >>>> LongArrayList
> > >>>> plus NullableLongArrayList) that explicitly defines the behavior,
> and
> > >>>> costs
> > >>>> a little more in performance.  If the user can't find a pre-made
> > wrapper
> > >>>> class, it's not very difficult for each user to provide their own
> > >>>> interpretation of null and check for it themselves.
> > >>>>
> > >>>> If you reject nullability, the question becomes what to do in
> > situations
> > >>>> where you're implementing existing interfaces that accept nullable
> > >>>> params.
> > >>>>   The LongArrayList above implements List<Long> which requires an
> > >>>> add(Long)
> > >>>> method.  In the above implementation I chose to swap nulls with
> > >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
> to
> > >>>> make
> > >>>> that swap and then throw IllegalArgumentException if they pass null.
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > >>>> doug.meil@explorysmedical.com
> > >>>>
> > >>>>> wrote:
> > >>>>> HmmmŠ good question.
> > >>>>>
> > >>>>> I think that fixed width support is important for a great many
> rowkey
> > >>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
> > and
> > >>>>> keeping fixed width.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> > >>>>>
> > >>>>>  Heya,
> > >>>>>>
> > >>>>>> Thinking about data types and serialization. I think null support
> is
> > >>>>>> an
> > >>>>>> important characteristic for the serialized representations,
> > >>>>>> especially
> > >>>>>> when considering the compound type. However, doing so in directly
> > >>>>>> incompatible with fixed-width representations for numerics. For
> > >>>>>>
> > >>>>> instance,
> > >>>>
> > >>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> where
> > >>>>>> do
> > >>>>>> you put null? float and double types can cheat a little by folding
> > >>>>>> negative
> > >>>>>> and positive NaN's into a single representation (this isn't
> strictly
> > >>>>>> correct!), leaving a place to represent null. In the long example
> > >>>>>> case,
> > >>>>>> the
> > >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> one.
> > >>>>>> This
> > >>>>>> will allocate an additional encoding which can be used for null.
> My
> > >>>>>> experience working with scientific data, however, makes me wince
> at
> > >>>>>> the
> > >>>>>> idea.
> > >>>>>>
> > >>>>>> The variable-width encodings have it a little easier. There's
> > already
> > >>>>>> enough going on that it's simpler to make room.
> > >>>>>>
> > >>>>>> Remember, the final goal is to support order-preserving
> > serialization.
> > >>>>>> This
> > >>>>>> imposes some limitations on our encoding strategies. For instance,
> > >>>>>> it's
> > >>>>>> not
> > >>>>>> enough to simply encode null, it really needs to be encoded as
> 0x00
> > so
> > >>>>>>
> > >>>>> as
> > >>>>
> > >>>>> to sort lexicographically earlier than any other value.
> > >>>>>>
> > >>>>>> What do you think? Any ideas, experiences, etc?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Nick
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>
> > >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Enis Söztutar <en...@gmail.com>.

I think having Int32, and NullableInt32 would support minimum overhead, as
well as allowing SQL semantics.


On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Furthermore, is is more important to support null values than squeeze all
> representations into minimum size (4-bytes for int32, &c.)?
> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>
> > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> From the SQL perspective, handling null is important.
> >
> >
> > From your perspective, it is critical to support NULLs, even at the
> > expense of fixed-width encodings at all or supporting representation of a
> > full range of values. That is, you'd rather be able to represent NULL
> than
> > -2^31?
> >
> > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>
> >>> Thanks for the thoughtful response (and code!).
> >>>
> >>> I'm thinking I will press forward with a base implementation that does
> >>> not
> >>> support nulls. The idea is to provide an extensible set of interfaces,
> >>> so I
> >>> think this will not box us into a corner later. That is, a mirroring
> >>> package could be implemented that supports null values and accepts
> >>> the relevant trade-offs.
> >>>
> >>> Thanks,
> >>> Nick
> >>>
> >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> >>> wrote:
> >>>
> >>>  I spent some time this weekend extracting bits of our serialization
> >>>> code to
> >>>> a public github repo at http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>> .
> >>>>   Contributions are welcome - i'm sure we all have this stuff laying
> >>>> around.
> >>>>
> >>>> You can see I've bumped into the NULL problem in a few places:
> >>>> *
> >>>>
> >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>> *
> >>>>
> >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>>
> >>>> Looking back, I think my latest opinion on the topic is to reject
> >>>> nullability as the rule since it can cause unexpected behavior and
> >>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>> LongArrayList
> >>>> plus NullableLongArrayList) that explicitly defines the behavior, and
> >>>> costs
> >>>> a little more in performance.  If the user can't find a pre-made
> wrapper
> >>>> class, it's not very difficult for each user to provide their own
> >>>> interpretation of null and check for it themselves.
> >>>>
> >>>> If you reject nullability, the question becomes what to do in
> situations
> >>>> where you're implementing existing interfaces that accept nullable
> >>>> params.
> >>>>   The LongArrayList above implements List<Long> which requires an
> >>>> add(Long)
> >>>> method.  In the above implementation I chose to swap nulls with
> >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
> >>>> make
> >>>> that swap and then throw IllegalArgumentException if they pass null.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>> doug.meil@explorysmedical.com
> >>>>
> >>>>> wrote:
> >>>>> HmmmŠ good question.
> >>>>>
> >>>>> I think that fixed width support is important for a great many rowkey
> >>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
> and
> >>>>> keeping fixed width.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  Heya,
> >>>>>>
> >>>>>> Thinking about data types and serialization. I think null support is
> >>>>>> an
> >>>>>> important characteristic for the serialized representations,
> >>>>>> especially
> >>>>>> when considering the compound type. However, doing so in directly
> >>>>>> incompatible with fixed-width representations for numerics. For
> >>>>>>
> >>>>> instance,
> >>>>
> >>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
> >>>>>> do
> >>>>>> you put null? float and double types can cheat a little by folding
> >>>>>> negative
> >>>>>> and positive NaN's into a single representation (this isn't strictly
> >>>>>> correct!), leaving a place to represent null. In the long example
> >>>>>> case,
> >>>>>> the
> >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
> >>>>>> This
> >>>>>> will allocate an additional encoding which can be used for null. My
> >>>>>> experience working with scientific data, however, makes me wince at
> >>>>>> the
> >>>>>> idea.
> >>>>>>
> >>>>>> The variable-width encodings have it a little easier. There's
> already
> >>>>>> enough going on that it's simpler to make room.
> >>>>>>
> >>>>>> Remember, the final goal is to support order-preserving
> serialization.
> >>>>>> This
> >>>>>> imposes some limitations on our encoding strategies. For instance,
> >>>>>> it's
> >>>>>> not
> >>>>>> enough to simply encode null, it really needs to be encoded as 0x00
> so
> >>>>>>
> >>>>> as
> >>>>
> >>>>> to sort lexicographically earlier than any other value.
> >>>>>>
> >>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Nick
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Enis Söztutar <en...@gmail.com>.

I think having Int32, and NullableInt32 would support minimum overhead, as
well as allowing SQL semantics.


On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> Furthermore, is is more important to support null values than squeeze all
> representations into minimum size (4-bytes for int32, &c.)?
> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>
> > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> From the SQL perspective, handling null is important.
> >
> >
> > From your perspective, it is critical to support NULLs, even at the
> > expense of fixed-width encodings at all or supporting representation of a
> > full range of values. That is, you'd rather be able to represent NULL
> than
> > -2^31?
> >
> > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>
> >>> Thanks for the thoughtful response (and code!).
> >>>
> >>> I'm thinking I will press forward with a base implementation that does
> >>> not
> >>> support nulls. The idea is to provide an extensible set of interfaces,
> >>> so I
> >>> think this will not box us into a corner later. That is, a mirroring
> >>> package could be implemented that supports null values and accepts
> >>> the relevant trade-offs.
> >>>
> >>> Thanks,
> >>> Nick
> >>>
> >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> >>> wrote:
> >>>
> >>>  I spent some time this weekend extracting bits of our serialization
> >>>> code to
> >>>> a public github repo at http://github.com/hotpads/**data-tools<
> http://github.com/hotpads/data-tools>
> >>>> .
> >>>>   Contributions are welcome - i'm sure we all have this stuff laying
> >>>> around.
> >>>>
> >>>> You can see I've bumped into the NULL problem in a few places:
> >>>> *
> >>>>
> >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>> *
> >>>>
> >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>>
> >>>> Looking back, I think my latest opinion on the topic is to reject
> >>>> nullability as the rule since it can cause unexpected behavior and
> >>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>> LongArrayList
> >>>> plus NullableLongArrayList) that explicitly defines the behavior, and
> >>>> costs
> >>>> a little more in performance.  If the user can't find a pre-made
> wrapper
> >>>> class, it's not very difficult for each user to provide their own
> >>>> interpretation of null and check for it themselves.
> >>>>
> >>>> If you reject nullability, the question becomes what to do in
> situations
> >>>> where you're implementing existing interfaces that accept nullable
> >>>> params.
> >>>>   The LongArrayList above implements List<Long> which requires an
> >>>> add(Long)
> >>>> method.  In the above implementation I chose to swap nulls with
> >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
> >>>> make
> >>>> that swap and then throw IllegalArgumentException if they pass null.
> >>>>
> >>>>
> >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>> doug.meil@explorysmedical.com
> >>>>
> >>>>> wrote:
> >>>>> HmmmŠ good question.
> >>>>>
> >>>>> I think that fixed width support is important for a great many rowkey
> >>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
> and
> >>>>> keeping fixed width.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>
> >>>>>  Heya,
> >>>>>>
> >>>>>> Thinking about data types and serialization. I think null support is
> >>>>>> an
> >>>>>> important characteristic for the serialized representations,
> >>>>>> especially
> >>>>>> when considering the compound type. However, doing so in directly
> >>>>>> incompatible with fixed-width representations for numerics. For
> >>>>>>
> >>>>> instance,
> >>>>
> >>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
> >>>>>> do
> >>>>>> you put null? float and double types can cheat a little by folding
> >>>>>> negative
> >>>>>> and positive NaN's into a single representation (this isn't strictly
> >>>>>> correct!), leaving a place to represent null. In the long example
> >>>>>> case,
> >>>>>> the
> >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
> >>>>>> This
> >>>>>> will allocate an additional encoding which can be used for null. My
> >>>>>> experience working with scientific data, however, makes me wince at
> >>>>>> the
> >>>>>> idea.
> >>>>>>
> >>>>>> The variable-width encodings have it a little easier. There's
> already
> >>>>>> enough going on that it's simpler to make room.
> >>>>>>
> >>>>>> Remember, the final goal is to support order-preserving
> serialization.
> >>>>>> This
> >>>>>> imposes some limitations on our encoding strategies. For instance,
> >>>>>> it's
> >>>>>> not
> >>>>>> enough to simply encode null, it really needs to be encoded as 0x00
> so
> >>>>>>
> >>>>> as
> >>>>
> >>>>> to sort lexicographically earlier than any other value.
> >>>>>>
> >>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Nick
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

Furthermore, is is more important to support null values than squeeze all
representations into minimum size (4-bytes for int32, &c.)?
On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:

> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>wrote:
>
>> From the SQL perspective, handling null is important.
>
>
> From your perspective, it is critical to support NULLs, even at the
> expense of fixed-width encodings at all or supporting representation of a
> full range of values. That is, you'd rather be able to represent NULL than
> -2^31?
>
> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>
>>> Thanks for the thoughtful response (and code!).
>>>
>>> I'm thinking I will press forward with a base implementation that does
>>> not
>>> support nulls. The idea is to provide an extensible set of interfaces,
>>> so I
>>> think this will not box us into a corner later. That is, a mirroring
>>> package could be implemented that supports null values and accepts
>>> the relevant trade-offs.
>>>
>>> Thanks,
>>> Nick
>>>
>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>> wrote:
>>>
>>>  I spent some time this weekend extracting bits of our serialization
>>>> code to
>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>> .
>>>>   Contributions are welcome - i'm sure we all have this stuff laying
>>>> around.
>>>>
>>>> You can see I've bumped into the NULL problem in a few places:
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>> Looking back, I think my latest opinion on the topic is to reject
>>>> nullability as the rule since it can cause unexpected behavior and
>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>> LongArrayList
>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>> costs
>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>> class, it's not very difficult for each user to provide their own
>>>> interpretation of null and check for it themselves.
>>>>
>>>> If you reject nullability, the question becomes what to do in situations
>>>> where you're implementing existing interfaces that accept nullable
>>>> params.
>>>>   The LongArrayList above implements List<Long> which requires an
>>>> add(Long)
>>>> method.  In the above implementation I chose to swap nulls with
>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>> make
>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>> doug.meil@explorysmedical.com
>>>>
>>>>> wrote:
>>>>> HmmmŠ good question.
>>>>>
>>>>> I think that fixed width support is important for a great many rowkey
>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>> keeping fixed width.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  Heya,
>>>>>>
>>>>>> Thinking about data types and serialization. I think null support is
>>>>>> an
>>>>>> important characteristic for the serialized representations,
>>>>>> especially
>>>>>> when considering the compound type. However, doing so in directly
>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>> instance,
>>>>
>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>> do
>>>>>> you put null? float and double types can cheat a little by folding
>>>>>> negative
>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>> case,
>>>>>> the
>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>> This
>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>> experience working with scientific data, however, makes me wince at
>>>>>> the
>>>>>> idea.
>>>>>>
>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>> enough going on that it's simpler to make room.
>>>>>>
>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>> This
>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>> it's
>>>>>> not
>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>>
>>>>> as
>>>>
>>>>> to sort lexicographically earlier than any other value.
>>>>>>
>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

Since SQL allows null valued composite key parts, we needed to support it.

On 04/01/2013 05:10 PM, Ted Yu wrote:
> bq. I create a dummy qualifier with a dummy value
>
> For any single application, the above can be done.
> For generic applications, how would we do this ?
>
> Thanks
>
>
> On Mon, Apr 1, 2013 at 5:07 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> I generally don't allow nulls in my composite row keys.  Does SQL allow
>> nulls in the PK?  In the rare case I wanted to do that I might create a
>> separate format called NullableCInt32 with 5 bytes where the first one
>> determined null.  It's important to keep the pure types pure.
>>
>> I have lots of null *values* however, but they're represented by lack of a
>> qualifier in the Put.  If a row has all null values, I create a dummy
>> qualifier with a dummy value to make sure the row key gets inserted as it
>> would in sql.
>>
>>
>> On Mon, Apr 1, 2013 at 4:49 PM, James Taylor <jt...@salesforce.com>
>> wrote:
>>
>>> On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
>>>
>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>
>>>> wrote:
>>>>
>>>>    From the SQL perspective, handling null is important.
>>>>   From your perspective, it is critical to support NULLs, even at the
>>>> expense
>>>> of fixed-width encodings at all or supporting representation of a full
>>>> range of values. That is, you'd rather be able to represent NULL than
>>>> -2^31?
>>>>
>>> We've been able to get away with supporting NULL through the absence of
>>> the value rather than restricting the data range. We haven't had any push
>>> back on not allowing a fixed width nullable leading row key column. Since
>>> our variable length DECIMAL supports null and is a superset of the fixed
>>> width numeric types, users have a reasonable alternative.
>>>
>>> I'd rather not restrict the range of values, since it doesn't seem like
>>> this would be necessary.
>>>
>>>
>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>
>>>>> Thanks for the thoughtful response (and code!).
>>>>>> I'm thinking I will press forward with a base implementation that does
>>>>>> not
>>>>>> support nulls. The idea is to provide an extensible set of interfaces,
>>>>>> so
>>>>>> I
>>>>>> think this will not box us into a corner later. That is, a mirroring
>>>>>> package could be implemented that supports null values and accepts
>>>>>> the relevant trade-offs.
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>>>> wrote:
>>>>>>
>>>>>>    I spent some time this weekend extracting bits of our serialization
>>>>>> code
>>>>>>
>>>>>>> to
>>>>>>> a public github repo at http://github.com/hotpads/****data-tools<
>> http://github.com/hotpads/**data-tools>
>>>>>>> <http://github.com/**hotpads/data-tools<
>> http://github.com/hotpads/data-tools>
>>>>>>> .
>>>>>>>     Contributions are welcome - i'm sure we all have this stuff laying
>>>>>>> around.
>>>>>>>
>>>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>>>> *
>>>>>>>
>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
>> https://github.com/hotpads/**data-tools/blob/master/src/**>
>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.****java<
>>>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
>>>>>>> LongArrayList.java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
>>>>>>> *
>>>>>>>
>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
>> https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****java<
>>>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>>>> master/src/main/java/com/**hotpads/data/types/floats/**
>>>>>>> DoubleByteTool.java<
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>>>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>> LongArrayList
>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>>>>> costs
>>>>>>> a little more in performance.  If the user can't find a pre-made
>>>>>>> wrapper
>>>>>>> class, it's not very difficult for each user to provide their own
>>>>>>> interpretation of null and check for it themselves.
>>>>>>>
>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>> situations
>>>>>>> where you're implementing existing interfaces that accept nullable
>>>>>>> params.
>>>>>>>     The LongArrayList above implements List<Long> which requires an
>>>>>>> add(Long)
>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>>>>> make
>>>>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>> doug.meil@explorysmedical.com
>>>>>>>
>>>>>>>   wrote:
>>>>>>>> HmmmŠ good question.
>>>>>>>>
>>>>>>>> I think that fixed width support is important for a great many
>> rowkey
>>>>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
>>>>>>>> and
>>>>>>>> keeping fixed width.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>    Heya,
>>>>>>>>
>>>>>>>>> Thinking about data types and serialization. I think null support
>> is
>>>>>>>>> an
>>>>>>>>> important characteristic for the serialized representations,
>>>>>>>>> especially
>>>>>>>>> when considering the compound type. However, doing so in directly
>>>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>>>>
>>>>>>>>>   instance,
>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>> where
>>>>>>>> do
>>>>>>>>
>>>>>>>>> you put null? float and double types can cheat a little by folding
>>>>>>>>> negative
>>>>>>>>> and positive NaN's into a single representation (this isn't
>> strictly
>>>>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>>>>> case,
>>>>>>>>> the
>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>>>>> This
>>>>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>>>>> experience working with scientific data, however, makes me wince at
>>>>>>>>> the
>>>>>>>>> idea.
>>>>>>>>>
>>>>>>>>> The variable-width encodings have it a little easier. There's
>> already
>>>>>>>>> enough going on that it's simpler to make room.
>>>>>>>>>
>>>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>>>> serialization.
>>>>>>>>> This
>>>>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>>>>> it's
>>>>>>>>> not
>>>>>>>>> enough to simply encode null, it really needs to be encoded as 0x00
>>>>>>>>> so
>>>>>>>>>
>>>>>>>>>   as
>>>>>>>> to sort lexicographically earlier than any other value.
>>>>>>>>
>>>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: HBase Types: Explicit Null Support

Posted by Ted Yu <yu...@gmail.com>.

bq. I create a dummy qualifier with a dummy value

For any single application, the above can be done.
For generic applications, how would we do this ?

Thanks


On Mon, Apr 1, 2013 at 5:07 PM, Matt Corgan <mc...@hotpads.com> wrote:

> I generally don't allow nulls in my composite row keys.  Does SQL allow
> nulls in the PK?  In the rare case I wanted to do that I might create a
> separate format called NullableCInt32 with 5 bytes where the first one
> determined null.  It's important to keep the pure types pure.
>
> I have lots of null *values* however, but they're represented by lack of a
> qualifier in the Put.  If a row has all null values, I create a dummy
> qualifier with a dummy value to make sure the row key gets inserted as it
> would in sql.
>
>
> On Mon, Apr 1, 2013 at 4:49 PM, James Taylor <jt...@salesforce.com>
> wrote:
>
> > On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
> >
> >> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>
> >> wrote:
> >>
> >>   From the SQL perspective, handling null is important.
> >>>
> >>
> >>  From your perspective, it is critical to support NULLs, even at the
> >> expense
> >> of fixed-width encodings at all or supporting representation of a full
> >> range of values. That is, you'd rather be able to represent NULL than
> >> -2^31?
> >>
> > We've been able to get away with supporting NULL through the absence of
> > the value rather than restricting the data range. We haven't had any push
> > back on not allowing a fixed width nullable leading row key column. Since
> > our variable length DECIMAL supports null and is a superset of the fixed
> > width numeric types, users have a reasonable alternative.
> >
> > I'd rather not restrict the range of values, since it doesn't seem like
> > this would be necessary.
> >
> >
> >> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> >>
> >>> Thanks for the thoughtful response (and code!).
> >>>>
> >>>> I'm thinking I will press forward with a base implementation that does
> >>>> not
> >>>> support nulls. The idea is to provide an extensible set of interfaces,
> >>>> so
> >>>> I
> >>>> think this will not box us into a corner later. That is, a mirroring
> >>>> package could be implemented that supports null values and accepts
> >>>> the relevant trade-offs.
> >>>>
> >>>> Thanks,
> >>>> Nick
> >>>>
> >>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
> >>>> wrote:
> >>>>
> >>>>   I spent some time this weekend extracting bits of our serialization
> >>>> code
> >>>>
> >>>>> to
> >>>>> a public github repo at http://github.com/hotpads/****data-tools<
> http://github.com/hotpads/**data-tools>
> >>>>> <http://github.com/**hotpads/data-tools<
> http://github.com/hotpads/data-tools>
> >>>>> >
> >>>>> .
> >>>>>    Contributions are welcome - i'm sure we all have this stuff laying
> >>>>> around.
> >>>>>
> >>>>> You can see I've bumped into the NULL problem in a few places:
> >>>>> *
> >>>>>
> >>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>>
> main/java/com/hotpads/data/****primitive/lists/LongArrayList.****java<
> >>>>> https://github.com/**hotpads/data-tools/blob/**
> >>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
> >>>>> LongArrayList.java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> >
> >>>>> >
> >>>>> *
> >>>>>
> >>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<
> https://github.com/hotpads/**data-tools/blob/master/src/**>
> >>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****java<
> >>>>> https://github.com/**hotpads/data-tools/blob/**
> >>>>> master/src/main/java/com/**hotpads/data/types/floats/**
> >>>>> DoubleByteTool.java<
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> >
> >>>>> >
> >>>>>
> >>>>> Looking back, I think my latest opinion on the topic is to reject
> >>>>> nullability as the rule since it can cause unexpected behavior and
> >>>>> confusion.  It's cleaner to provide a wrapper class (so both
> >>>>> LongArrayList
> >>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
> >>>>> costs
> >>>>> a little more in performance.  If the user can't find a pre-made
> >>>>> wrapper
> >>>>> class, it's not very difficult for each user to provide their own
> >>>>> interpretation of null and check for it themselves.
> >>>>>
> >>>>> If you reject nullability, the question becomes what to do in
> >>>>> situations
> >>>>> where you're implementing existing interfaces that accept nullable
> >>>>> params.
> >>>>>    The LongArrayList above implements List<Long> which requires an
> >>>>> add(Long)
> >>>>> method.  In the above implementation I chose to swap nulls with
> >>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
> >>>>> make
> >>>>> that swap and then throw IllegalArgumentException if they pass null.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> >>>>> doug.meil@explorysmedical.com
> >>>>>
> >>>>>  wrote:
> >>>>>> HmmmŠ good question.
> >>>>>>
> >>>>>> I think that fixed width support is important for a great many
> rowkey
> >>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
> >>>>>> and
> >>>>>> keeping fixed width.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >>>>>>
> >>>>>>   Heya,
> >>>>>>
> >>>>>>> Thinking about data types and serialization. I think null support
> is
> >>>>>>> an
> >>>>>>> important characteristic for the serialized representations,
> >>>>>>> especially
> >>>>>>> when considering the compound type. However, doing so in directly
> >>>>>>> incompatible with fixed-width representations for numerics. For
> >>>>>>>
> >>>>>>>  instance,
> >>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> where
> >>>>>> do
> >>>>>>
> >>>>>>> you put null? float and double types can cheat a little by folding
> >>>>>>> negative
> >>>>>>> and positive NaN's into a single representation (this isn't
> strictly
> >>>>>>> correct!), leaving a place to represent null. In the long example
> >>>>>>> case,
> >>>>>>> the
> >>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
> >>>>>>> This
> >>>>>>> will allocate an additional encoding which can be used for null. My
> >>>>>>> experience working with scientific data, however, makes me wince at
> >>>>>>> the
> >>>>>>> idea.
> >>>>>>>
> >>>>>>> The variable-width encodings have it a little easier. There's
> already
> >>>>>>> enough going on that it's simpler to make room.
> >>>>>>>
> >>>>>>> Remember, the final goal is to support order-preserving
> >>>>>>> serialization.
> >>>>>>> This
> >>>>>>> imposes some limitations on our encoding strategies. For instance,
> >>>>>>> it's
> >>>>>>> not
> >>>>>>> enough to simply encode null, it really needs to be encoded as 0x00
> >>>>>>> so
> >>>>>>>
> >>>>>>>  as
> >>>>>> to sort lexicographically earlier than any other value.
> >>>>>>
> >>>>>>> What do you think? Any ideas, experiences, etc?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Nick
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

I generally don't allow nulls in my composite row keys.  Does SQL allow
nulls in the PK?  In the rare case I wanted to do that I might create a
separate format called NullableCInt32 with 5 bytes where the first one
determined null.  It's important to keep the pure types pure.

I have lots of null *values* however, but they're represented by lack of a
qualifier in the Put.  If a row has all null values, I create a dummy
qualifier with a dummy value to make sure the row key gets inserted as it
would in sql.


On Mon, Apr 1, 2013 at 4:49 PM, James Taylor <jt...@salesforce.com> wrote:

> On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
>
>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>
>> wrote:
>>
>>   From the SQL perspective, handling null is important.
>>>
>>
>>  From your perspective, it is critical to support NULLs, even at the
>> expense
>> of fixed-width encodings at all or supporting representation of a full
>> range of values. That is, you'd rather be able to represent NULL than
>> -2^31?
>>
> We've been able to get away with supporting NULL through the absence of
> the value rather than restricting the data range. We haven't had any push
> back on not allowing a fixed width nullable leading row key column. Since
> our variable length DECIMAL supports null and is a superset of the fixed
> width numeric types, users have a reasonable alternative.
>
> I'd rather not restrict the range of values, since it doesn't seem like
> this would be necessary.
>
>
>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>
>>> Thanks for the thoughtful response (and code!).
>>>>
>>>> I'm thinking I will press forward with a base implementation that does
>>>> not
>>>> support nulls. The idea is to provide an extensible set of interfaces,
>>>> so
>>>> I
>>>> think this will not box us into a corner later. That is, a mirroring
>>>> package could be implemented that supports null values and accepts
>>>> the relevant trade-offs.
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>> wrote:
>>>>
>>>>   I spent some time this weekend extracting bits of our serialization
>>>> code
>>>>
>>>>> to
>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>> <http://github.com/**hotpads/data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>> .
>>>>>    Contributions are welcome - i'm sure we all have this stuff laying
>>>>> around.
>>>>>
>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>> *
>>>>>
>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.****java<
>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
>>>>> LongArrayList.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> >
>>>>> *
>>>>>
>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****java<
>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/types/floats/**
>>>>> DoubleByteTool.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> >
>>>>>
>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>> LongArrayList
>>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>>> costs
>>>>> a little more in performance.  If the user can't find a pre-made
>>>>> wrapper
>>>>> class, it's not very difficult for each user to provide their own
>>>>> interpretation of null and check for it themselves.
>>>>>
>>>>> If you reject nullability, the question becomes what to do in
>>>>> situations
>>>>> where you're implementing existing interfaces that accept nullable
>>>>> params.
>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>> add(Long)
>>>>> method.  In the above implementation I chose to swap nulls with
>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>>> make
>>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>> doug.meil@explorysmedical.com
>>>>>
>>>>>  wrote:
>>>>>> HmmmŠ good question.
>>>>>>
>>>>>> I think that fixed width support is important for a great many rowkey
>>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
>>>>>> and
>>>>>> keeping fixed width.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>
>>>>>>   Heya,
>>>>>>
>>>>>>> Thinking about data types and serialization. I think null support is
>>>>>>> an
>>>>>>> important characteristic for the serialized representations,
>>>>>>> especially
>>>>>>> when considering the compound type. However, doing so in directly
>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>>
>>>>>>>  instance,
>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>> do
>>>>>>
>>>>>>> you put null? float and double types can cheat a little by folding
>>>>>>> negative
>>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>>> case,
>>>>>>> the
>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>>> This
>>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>>> experience working with scientific data, however, makes me wince at
>>>>>>> the
>>>>>>> idea.
>>>>>>>
>>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>>> enough going on that it's simpler to make room.
>>>>>>>
>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>> serialization.
>>>>>>> This
>>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>>> it's
>>>>>>> not
>>>>>>> enough to simply encode null, it really needs to be encoded as 0x00
>>>>>>> so
>>>>>>>
>>>>>>>  as
>>>>>> to sort lexicographically earlier than any other value.
>>>>>>
>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

I generally don't allow nulls in my composite row keys.  Does SQL allow
nulls in the PK?  In the rare case I wanted to do that I might create a
separate format called NullableCInt32 with 5 bytes where the first one
determined null.  It's important to keep the pure types pure.

I have lots of null *values* however, but they're represented by lack of a
qualifier in the Put.  If a row has all null values, I create a dummy
qualifier with a dummy value to make sure the row key gets inserted as it
would in sql.


On Mon, Apr 1, 2013 at 4:49 PM, James Taylor <jt...@salesforce.com> wrote:

> On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
>
>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>
>> wrote:
>>
>>   From the SQL perspective, handling null is important.
>>>
>>
>>  From your perspective, it is critical to support NULLs, even at the
>> expense
>> of fixed-width encodings at all or supporting representation of a full
>> range of values. That is, you'd rather be able to represent NULL than
>> -2^31?
>>
> We've been able to get away with supporting NULL through the absence of
> the value rather than restricting the data range. We haven't had any push
> back on not allowing a fixed width nullable leading row key column. Since
> our variable length DECIMAL supports null and is a superset of the fixed
> width numeric types, users have a reasonable alternative.
>
> I'd rather not restrict the range of values, since it doesn't seem like
> this would be necessary.
>
>
>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>
>>> Thanks for the thoughtful response (and code!).
>>>>
>>>> I'm thinking I will press forward with a base implementation that does
>>>> not
>>>> support nulls. The idea is to provide an extensible set of interfaces,
>>>> so
>>>> I
>>>> think this will not box us into a corner later. That is, a mirroring
>>>> package could be implemented that supports null values and accepts
>>>> the relevant trade-offs.
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>>> wrote:
>>>>
>>>>   I spent some time this weekend extracting bits of our serialization
>>>> code
>>>>
>>>>> to
>>>>> a public github repo at http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>> <http://github.com/**hotpads/data-tools<http://github.com/hotpads/data-tools>
>>>>> >
>>>>> .
>>>>>    Contributions are welcome - i'm sure we all have this stuff laying
>>>>> around.
>>>>>
>>>>> You can see I've bumped into the NULL problem in a few places:
>>>>> *
>>>>>
>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.****java<
>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
>>>>> LongArrayList.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> >
>>>>> *
>>>>>
>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****java<
>>>>> https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/types/floats/**
>>>>> DoubleByteTool.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> >
>>>>>
>>>>> Looking back, I think my latest opinion on the topic is to reject
>>>>> nullability as the rule since it can cause unexpected behavior and
>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>> LongArrayList
>>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>>> costs
>>>>> a little more in performance.  If the user can't find a pre-made
>>>>> wrapper
>>>>> class, it's not very difficult for each user to provide their own
>>>>> interpretation of null and check for it themselves.
>>>>>
>>>>> If you reject nullability, the question becomes what to do in
>>>>> situations
>>>>> where you're implementing existing interfaces that accept nullable
>>>>> params.
>>>>>    The LongArrayList above implements List<Long> which requires an
>>>>> add(Long)
>>>>> method.  In the above implementation I chose to swap nulls with
>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>>> make
>>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>> doug.meil@explorysmedical.com
>>>>>
>>>>>  wrote:
>>>>>> HmmmŠ good question.
>>>>>>
>>>>>> I think that fixed width support is important for a great many rowkey
>>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE
>>>>>> and
>>>>>> keeping fixed width.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>>
>>>>>>   Heya,
>>>>>>
>>>>>>> Thinking about data types and serialization. I think null support is
>>>>>>> an
>>>>>>> important characteristic for the serialized representations,
>>>>>>> especially
>>>>>>> when considering the compound type. However, doing so in directly
>>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>>
>>>>>>>  instance,
>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>> do
>>>>>>
>>>>>>> you put null? float and double types can cheat a little by folding
>>>>>>> negative
>>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>>> case,
>>>>>>> the
>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>>> This
>>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>>> experience working with scientific data, however, makes me wince at
>>>>>>> the
>>>>>>> idea.
>>>>>>>
>>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>>> enough going on that it's simpler to make room.
>>>>>>>
>>>>>>> Remember, the final goal is to support order-preserving
>>>>>>> serialization.
>>>>>>> This
>>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>>> it's
>>>>>>> not
>>>>>>> enough to simply encode null, it really needs to be encoded as 0x00
>>>>>>> so
>>>>>>>
>>>>>>>  as
>>>>>> to sort lexicographically earlier than any other value.
>>>>>>
>>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com> wrote:
>
>>  From the SQL perspective, handling null is important.
>
>  From your perspective, it is critical to support NULLs, even at the expense
> of fixed-width encodings at all or supporting representation of a full
> range of values. That is, you'd rather be able to represent NULL than -2^31?
We've been able to get away with supporting NULL through the absence of 
the value rather than restricting the data range. We haven't had any 
push back on not allowing a fixed width nullable leading row key column. 
Since our variable length DECIMAL supports null and is a superset of the 
fixed width numeric types, users have a reasonable alternative.

I'd rather not restrict the range of values, since it doesn't seem like 
this would be necessary.
>
> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>> Thanks for the thoughtful response (and code!).
>>>
>>> I'm thinking I will press forward with a base implementation that does not
>>> support nulls. The idea is to provide an extensible set of interfaces, so
>>> I
>>> think this will not box us into a corner later. That is, a mirroring
>>> package could be implemented that supports null values and accepts
>>> the relevant trade-offs.
>>>
>>> Thanks,
>>> Nick
>>>
>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>>
>>>   I spent some time this weekend extracting bits of our serialization code
>>>> to
>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>> .
>>>>    Contributions are welcome - i'm sure we all have this stuff laying
>>>> around.
>>>>
>>>> You can see I've bumped into the NULL problem in a few places:
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>> Looking back, I think my latest opinion on the topic is to reject
>>>> nullability as the rule since it can cause unexpected behavior and
>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>> LongArrayList
>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>> costs
>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>> class, it's not very difficult for each user to provide their own
>>>> interpretation of null and check for it themselves.
>>>>
>>>> If you reject nullability, the question becomes what to do in situations
>>>> where you're implementing existing interfaces that accept nullable
>>>> params.
>>>>    The LongArrayList above implements List<Long> which requires an
>>>> add(Long)
>>>> method.  In the above implementation I chose to swap nulls with
>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>> make
>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>> doug.meil@explorysmedical.com
>>>>
>>>>> wrote:
>>>>> HmmmŠ good question.
>>>>>
>>>>> I think that fixed width support is important for a great many rowkey
>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>> keeping fixed width.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>   Heya,
>>>>>> Thinking about data types and serialization. I think null support is an
>>>>>> important characteristic for the serialized representations, especially
>>>>>> when considering the compound type. However, doing so in directly
>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>> instance,
>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>>>> you put null? float and double types can cheat a little by folding
>>>>>> negative
>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>> correct!), leaving a place to represent null. In the long example case,
>>>>>> the
>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>> This
>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>> experience working with scientific data, however, makes me wince at the
>>>>>> idea.
>>>>>>
>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>> enough going on that it's simpler to make room.
>>>>>>
>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>> This
>>>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>>>> not
>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>>
>>>>> as
>>>>> to sort lexicographically earlier than any other value.
>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>
>>>>>
>>>>>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

On 04/01/2013 04:41 PM, Nick Dimiduk wrote:
> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com> wrote:
>
>>  From the SQL perspective, handling null is important.
>
>  From your perspective, it is critical to support NULLs, even at the expense
> of fixed-width encodings at all or supporting representation of a full
> range of values. That is, you'd rather be able to represent NULL than -2^31?
We've been able to get away with supporting NULL through the absence of 
the value rather than restricting the data range. We haven't had any 
push back on not allowing a fixed width nullable leading row key column. 
Since our variable length DECIMAL supports null and is a superset of the 
fixed width numeric types, users have a reasonable alternative.

I'd rather not restrict the range of values, since it doesn't seem like 
this would be necessary.
>
> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>> Thanks for the thoughtful response (and code!).
>>>
>>> I'm thinking I will press forward with a base implementation that does not
>>> support nulls. The idea is to provide an extensible set of interfaces, so
>>> I
>>> think this will not box us into a corner later. That is, a mirroring
>>> package could be implemented that supports null values and accepts
>>> the relevant trade-offs.
>>>
>>> Thanks,
>>> Nick
>>>
>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>>
>>>   I spent some time this weekend extracting bits of our serialization code
>>>> to
>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>> .
>>>>    Contributions are welcome - i'm sure we all have this stuff laying
>>>> around.
>>>>
>>>> You can see I've bumped into the NULL problem in a few places:
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>> Looking back, I think my latest opinion on the topic is to reject
>>>> nullability as the rule since it can cause unexpected behavior and
>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>> LongArrayList
>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>> costs
>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>> class, it's not very difficult for each user to provide their own
>>>> interpretation of null and check for it themselves.
>>>>
>>>> If you reject nullability, the question becomes what to do in situations
>>>> where you're implementing existing interfaces that accept nullable
>>>> params.
>>>>    The LongArrayList above implements List<Long> which requires an
>>>> add(Long)
>>>> method.  In the above implementation I chose to swap nulls with
>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>> make
>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>> doug.meil@explorysmedical.com
>>>>
>>>>> wrote:
>>>>> HmmmŠ good question.
>>>>>
>>>>> I think that fixed width support is important for a great many rowkey
>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>> keeping fixed width.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>   Heya,
>>>>>> Thinking about data types and serialization. I think null support is an
>>>>>> important characteristic for the serialized representations, especially
>>>>>> when considering the compound type. However, doing so in directly
>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>> instance,
>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>>>> you put null? float and double types can cheat a little by folding
>>>>>> negative
>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>> correct!), leaving a place to represent null. In the long example case,
>>>>>> the
>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>> This
>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>> experience working with scientific data, however, makes me wince at the
>>>>>> idea.
>>>>>>
>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>> enough going on that it's simpler to make room.
>>>>>>
>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>> This
>>>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>>>> not
>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>>
>>>>> as
>>>>> to sort lexicographically earlier than any other value.
>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>
>>>>>
>>>>>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

Furthermore, is is more important to support null values than squeeze all
representations into minimum size (4-bytes for int32, &c.)?
On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:

> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com>wrote:
>
>> From the SQL perspective, handling null is important.
>
>
> From your perspective, it is critical to support NULLs, even at the
> expense of fixed-width encodings at all or supporting representation of a
> full range of values. That is, you'd rather be able to represent NULL than
> -2^31?
>
> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>
>>> Thanks for the thoughtful response (and code!).
>>>
>>> I'm thinking I will press forward with a base implementation that does
>>> not
>>> support nulls. The idea is to provide an extensible set of interfaces,
>>> so I
>>> think this will not box us into a corner later. That is, a mirroring
>>> package could be implemented that supports null values and accepts
>>> the relevant trade-offs.
>>>
>>> Thanks,
>>> Nick
>>>
>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com>
>>> wrote:
>>>
>>>  I spent some time this weekend extracting bits of our serialization
>>>> code to
>>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>>> .
>>>>   Contributions are welcome - i'm sure we all have this stuff laying
>>>> around.
>>>>
>>>> You can see I've bumped into the NULL problem in a few places:
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>> *
>>>>
>>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>
>>>> Looking back, I think my latest opinion on the topic is to reject
>>>> nullability as the rule since it can cause unexpected behavior and
>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>> LongArrayList
>>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>>> costs
>>>> a little more in performance.  If the user can't find a pre-made wrapper
>>>> class, it's not very difficult for each user to provide their own
>>>> interpretation of null and check for it themselves.
>>>>
>>>> If you reject nullability, the question becomes what to do in situations
>>>> where you're implementing existing interfaces that accept nullable
>>>> params.
>>>>   The LongArrayList above implements List<Long> which requires an
>>>> add(Long)
>>>> method.  In the above implementation I chose to swap nulls with
>>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>>> make
>>>> that swap and then throw IllegalArgumentException if they pass null.
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>> doug.meil@explorysmedical.com
>>>>
>>>>> wrote:
>>>>> HmmmŠ good question.
>>>>>
>>>>> I think that fixed width support is important for a great many rowkey
>>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>>> keeping fixed width.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>>
>>>>>  Heya,
>>>>>>
>>>>>> Thinking about data types and serialization. I think null support is
>>>>>> an
>>>>>> important characteristic for the serialized representations,
>>>>>> especially
>>>>>> when considering the compound type. However, doing so in directly
>>>>>> incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>> instance,
>>>>
>>>>> if we want to have a fixed-width signed long stored on 8-bytes, where
>>>>>> do
>>>>>> you put null? float and double types can cheat a little by folding
>>>>>> negative
>>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>>> correct!), leaving a place to represent null. In the long example
>>>>>> case,
>>>>>> the
>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>>> This
>>>>>> will allocate an additional encoding which can be used for null. My
>>>>>> experience working with scientific data, however, makes me wince at
>>>>>> the
>>>>>> idea.
>>>>>>
>>>>>> The variable-width encodings have it a little easier. There's already
>>>>>> enough going on that it's simpler to make room.
>>>>>>
>>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>>> This
>>>>>> imposes some limitations on our encoding strategies. For instance,
>>>>>> it's
>>>>>> not
>>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>>
>>>>> as
>>>>
>>>>> to sort lexicographically earlier than any other value.
>>>>>>
>>>>>> What do you think? Any ideas, experiences, etc?
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com> wrote:

> From the SQL perspective, handling null is important.


>From your perspective, it is critical to support NULLs, even at the expense
of fixed-width encodings at all or supporting representation of a full
range of values. That is, you'd rather be able to represent NULL than -2^31?

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>
>> Thanks for the thoughtful response (and code!).
>>
>> I'm thinking I will press forward with a base implementation that does not
>> support nulls. The idea is to provide an extensible set of interfaces, so
>> I
>> think this will not box us into a corner later. That is, a mirroring
>> package could be implemented that supports null values and accepts
>> the relevant trade-offs.
>>
>> Thanks,
>> Nick
>>
>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  I spent some time this weekend extracting bits of our serialization code
>>> to
>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>> .
>>>   Contributions are welcome - i'm sure we all have this stuff laying
>>> around.
>>>
>>> You can see I've bumped into the NULL problem in a few places:
>>> *
>>>
>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>> *
>>>
>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>> Looking back, I think my latest opinion on the topic is to reject
>>> nullability as the rule since it can cause unexpected behavior and
>>> confusion.  It's cleaner to provide a wrapper class (so both
>>> LongArrayList
>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>> costs
>>> a little more in performance.  If the user can't find a pre-made wrapper
>>> class, it's not very difficult for each user to provide their own
>>> interpretation of null and check for it themselves.
>>>
>>> If you reject nullability, the question becomes what to do in situations
>>> where you're implementing existing interfaces that accept nullable
>>> params.
>>>   The LongArrayList above implements List<Long> which requires an
>>> add(Long)
>>> method.  In the above implementation I chose to swap nulls with
>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>> make
>>> that swap and then throw IllegalArgumentException if they pass null.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>> doug.meil@explorysmedical.com
>>>
>>>> wrote:
>>>> HmmmŠ good question.
>>>>
>>>> I think that fixed width support is important for a great many rowkey
>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>> keeping fixed width.
>>>>
>>>>
>>>>
>>>>
>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>
>>>>  Heya,
>>>>>
>>>>> Thinking about data types and serialization. I think null support is an
>>>>> important characteristic for the serialized representations, especially
>>>>> when considering the compound type. However, doing so in directly
>>>>> incompatible with fixed-width representations for numerics. For
>>>>>
>>>> instance,
>>>
>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>>> you put null? float and double types can cheat a little by folding
>>>>> negative
>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>> correct!), leaving a place to represent null. In the long example case,
>>>>> the
>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>> This
>>>>> will allocate an additional encoding which can be used for null. My
>>>>> experience working with scientific data, however, makes me wince at the
>>>>> idea.
>>>>>
>>>>> The variable-width encodings have it a little easier. There's already
>>>>> enough going on that it's simpler to make room.
>>>>>
>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>> This
>>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>>> not
>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>
>>>> as
>>>
>>>> to sort lexicographically earlier than any other value.
>>>>>
>>>>> What do you think? Any ideas, experiences, etc?
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>
>>>>
>>>>
>>>>
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jt...@salesforce.com> wrote:

> From the SQL perspective, handling null is important.


>From your perspective, it is critical to support NULLs, even at the expense
of fixed-width encodings at all or supporting representation of a full
range of values. That is, you'd rather be able to represent NULL than -2^31?

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>
>> Thanks for the thoughtful response (and code!).
>>
>> I'm thinking I will press forward with a base implementation that does not
>> support nulls. The idea is to provide an extensible set of interfaces, so
>> I
>> think this will not box us into a corner later. That is, a mirroring
>> package could be implemented that supports null values and accepts
>> the relevant trade-offs.
>>
>> Thanks,
>> Nick
>>
>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>>
>>  I spent some time this weekend extracting bits of our serialization code
>>> to
>>> a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
>>> .
>>>   Contributions are welcome - i'm sure we all have this stuff laying
>>> around.
>>>
>>> You can see I've bumped into the NULL problem in a few places:
>>> *
>>>
>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>> *
>>>
>>> https://github.com/hotpads/**data-tools/blob/master/src/**
>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>
>>> Looking back, I think my latest opinion on the topic is to reject
>>> nullability as the rule since it can cause unexpected behavior and
>>> confusion.  It's cleaner to provide a wrapper class (so both
>>> LongArrayList
>>> plus NullableLongArrayList) that explicitly defines the behavior, and
>>> costs
>>> a little more in performance.  If the user can't find a pre-made wrapper
>>> class, it's not very difficult for each user to provide their own
>>> interpretation of null and check for it themselves.
>>>
>>> If you reject nullability, the question becomes what to do in situations
>>> where you're implementing existing interfaces that accept nullable
>>> params.
>>>   The LongArrayList above implements List<Long> which requires an
>>> add(Long)
>>> method.  In the above implementation I chose to swap nulls with
>>> Long.MIN_VALUE, however I'm now thinking it best to force the user to
>>> make
>>> that swap and then throw IllegalArgumentException if they pass null.
>>>
>>>
>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>> doug.meil@explorysmedical.com
>>>
>>>> wrote:
>>>> HmmmŠ good question.
>>>>
>>>> I think that fixed width support is important for a great many rowkey
>>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>>> keeping fixed width.
>>>>
>>>>
>>>>
>>>>
>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>>
>>>>  Heya,
>>>>>
>>>>> Thinking about data types and serialization. I think null support is an
>>>>> important characteristic for the serialized representations, especially
>>>>> when considering the compound type. However, doing so in directly
>>>>> incompatible with fixed-width representations for numerics. For
>>>>>
>>>> instance,
>>>
>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>>> you put null? float and double types can cheat a little by folding
>>>>> negative
>>>>> and positive NaN's into a single representation (this isn't strictly
>>>>> correct!), leaving a place to represent null. In the long example case,
>>>>> the
>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
>>>>> This
>>>>> will allocate an additional encoding which can be used for null. My
>>>>> experience working with scientific data, however, makes me wince at the
>>>>> idea.
>>>>>
>>>>> The variable-width encodings have it a little easier. There's already
>>>>> enough going on that it's simpler to make room.
>>>>>
>>>>> Remember, the final goal is to support order-preserving serialization.
>>>>> This
>>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>>> not
>>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>>>>>
>>>> as
>>>
>>>> to sort lexicographically earlier than any other value.
>>>>>
>>>>> What do you think? Any ideas, experiences, etc?
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>
>>>>
>>>>
>>>>
>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

 From the SQL perspective, handling null is important. Phoenix supports 
null in the following way:
- the absence of a key value
- an empty value in a key value
- an empty value in a multi part row key
   - for variable length types (VARCHAR and DECIMAL) a null byte 
separator would be used if not the last column
   - for fixed width types only the last column is allowed to be null

As you mentioned, it's important to maintain the lexicographical sort 
order with nulls being first.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> Thanks for the thoughtful response (and code!).
>
> I'm thinking I will press forward with a base implementation that does not
> support nulls. The idea is to provide an extensible set of interfaces, so I
> think this will not box us into a corner later. That is, a mirroring
> package could be implemented that supports null values and accepts
> the relevant trade-offs.
>
> Thanks,
> Nick
>
> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> I spent some time this weekend extracting bits of our serialization code to
>> a public github repo at http://github.com/hotpads/data-tools.
>>   Contributions are welcome - i'm sure we all have this stuff laying around.
>>
>> You can see I've bumped into the NULL problem in a few places:
>> *
>>
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
>> *
>>
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>>
>> Looking back, I think my latest opinion on the topic is to reject
>> nullability as the rule since it can cause unexpected behavior and
>> confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
>> plus NullableLongArrayList) that explicitly defines the behavior, and costs
>> a little more in performance.  If the user can't find a pre-made wrapper
>> class, it's not very difficult for each user to provide their own
>> interpretation of null and check for it themselves.
>>
>> If you reject nullability, the question becomes what to do in situations
>> where you're implementing existing interfaces that accept nullable params.
>>   The LongArrayList above implements List<Long> which requires an add(Long)
>> method.  In the above implementation I chose to swap nulls with
>> Long.MIN_VALUE, however I'm now thinking it best to force the user to make
>> that swap and then throw IllegalArgumentException if they pass null.
>>
>>
>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <doug.meil@explorysmedical.com
>>> wrote:
>>> HmmmŠ good question.
>>>
>>> I think that fixed width support is important for a great many rowkey
>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>> keeping fixed width.
>>>
>>>
>>>
>>>
>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>
>>>> Heya,
>>>>
>>>> Thinking about data types and serialization. I think null support is an
>>>> important characteristic for the serialized representations, especially
>>>> when considering the compound type. However, doing so in directly
>>>> incompatible with fixed-width representations for numerics. For
>> instance,
>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>> you put null? float and double types can cheat a little by folding
>>>> negative
>>>> and positive NaN's into a single representation (this isn't strictly
>>>> correct!), leaving a place to represent null. In the long example case,
>>>> the
>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
>>>> will allocate an additional encoding which can be used for null. My
>>>> experience working with scientific data, however, makes me wince at the
>>>> idea.
>>>>
>>>> The variable-width encodings have it a little easier. There's already
>>>> enough going on that it's simpler to make room.
>>>>
>>>> Remember, the final goal is to support order-preserving serialization.
>>>> This
>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>> not
>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>> as
>>>> to sort lexicographically earlier than any other value.
>>>>
>>>> What do you think? Any ideas, experiences, etc?
>>>>
>>>> Thanks,
>>>> Nick
>>>
>>>
>>>

Re: HBase Types: Explicit Null Support

Posted by James Taylor <jt...@salesforce.com>.

 From the SQL perspective, handling null is important. Phoenix supports 
null in the following way:
- the absence of a key value
- an empty value in a key value
- an empty value in a multi part row key
   - for variable length types (VARCHAR and DECIMAL) a null byte 
separator would be used if not the last column
   - for fixed width types only the last column is allowed to be null

As you mentioned, it's important to maintain the lexicographical sort 
order with nulls being first.

On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> Thanks for the thoughtful response (and code!).
>
> I'm thinking I will press forward with a base implementation that does not
> support nulls. The idea is to provide an extensible set of interfaces, so I
> think this will not box us into a corner later. That is, a mirroring
> package could be implemented that supports null values and accepts
> the relevant trade-offs.
>
> Thanks,
> Nick
>
> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
>> I spent some time this weekend extracting bits of our serialization code to
>> a public github repo at http://github.com/hotpads/data-tools.
>>   Contributions are welcome - i'm sure we all have this stuff laying around.
>>
>> You can see I've bumped into the NULL problem in a few places:
>> *
>>
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
>> *
>>
>> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>>
>> Looking back, I think my latest opinion on the topic is to reject
>> nullability as the rule since it can cause unexpected behavior and
>> confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
>> plus NullableLongArrayList) that explicitly defines the behavior, and costs
>> a little more in performance.  If the user can't find a pre-made wrapper
>> class, it's not very difficult for each user to provide their own
>> interpretation of null and check for it themselves.
>>
>> If you reject nullability, the question becomes what to do in situations
>> where you're implementing existing interfaces that accept nullable params.
>>   The LongArrayList above implements List<Long> which requires an add(Long)
>> method.  In the above implementation I chose to swap nulls with
>> Long.MIN_VALUE, however I'm now thinking it best to force the user to make
>> that swap and then throw IllegalArgumentException if they pass null.
>>
>>
>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <doug.meil@explorysmedical.com
>>> wrote:
>>> HmmmŠ good question.
>>>
>>> I think that fixed width support is important for a great many rowkey
>>> constructs cases, so I'd rather see something like losing MIN_VALUE and
>>> keeping fixed width.
>>>
>>>
>>>
>>>
>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>>>
>>>> Heya,
>>>>
>>>> Thinking about data types and serialization. I think null support is an
>>>> important characteristic for the serialized representations, especially
>>>> when considering the compound type. However, doing so in directly
>>>> incompatible with fixed-width representations for numerics. For
>> instance,
>>>> if we want to have a fixed-width signed long stored on 8-bytes, where do
>>>> you put null? float and double types can cheat a little by folding
>>>> negative
>>>> and positive NaN's into a single representation (this isn't strictly
>>>> correct!), leaving a place to represent null. In the long example case,
>>>> the
>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
>>>> will allocate an additional encoding which can be used for null. My
>>>> experience working with scientific data, however, makes me wince at the
>>>> idea.
>>>>
>>>> The variable-width encodings have it a little easier. There's already
>>>> enough going on that it's simpler to make room.
>>>>
>>>> Remember, the final goal is to support order-preserving serialization.
>>>> This
>>>> imposes some limitations on our encoding strategies. For instance, it's
>>>> not
>>>> enough to simply encode null, it really needs to be encoded as 0x00 so
>> as
>>>> to sort lexicographically earlier than any other value.
>>>>
>>>> What do you think? Any ideas, experiences, etc?
>>>>
>>>> Thanks,
>>>> Nick
>>>
>>>
>>>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does not
support nulls. The idea is to provide an extensible set of interfaces, so I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:

> I spent some time this weekend extracting bits of our serialization code to
> a public github repo at http://github.com/hotpads/data-tools.
>  Contributions are welcome - i'm sure we all have this stuff laying around.
>
> You can see I've bumped into the NULL problem in a few places:
> *
>
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> *
>
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>
> Looking back, I think my latest opinion on the topic is to reject
> nullability as the rule since it can cause unexpected behavior and
> confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
> plus NullableLongArrayList) that explicitly defines the behavior, and costs
> a little more in performance.  If the user can't find a pre-made wrapper
> class, it's not very difficult for each user to provide their own
> interpretation of null and check for it themselves.
>
> If you reject nullability, the question becomes what to do in situations
> where you're implementing existing interfaces that accept nullable params.
>  The LongArrayList above implements List<Long> which requires an add(Long)
> method.  In the above implementation I chose to swap nulls with
> Long.MIN_VALUE, however I'm now thinking it best to force the user to make
> that swap and then throw IllegalArgumentException if they pass null.
>
>
> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <doug.meil@explorysmedical.com
> >wrote:
>
> >
> > HmmmŠ good question.
> >
> > I think that fixed width support is important for a great many rowkey
> > constructs cases, so I'd rather see something like losing MIN_VALUE and
> > keeping fixed width.
> >
> >
> >
> >
> > On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >
> > >Heya,
> > >
> > >Thinking about data types and serialization. I think null support is an
> > >important characteristic for the serialized representations, especially
> > >when considering the compound type. However, doing so in directly
> > >incompatible with fixed-width representations for numerics. For
> instance,
> > >if we want to have a fixed-width signed long stored on 8-bytes, where do
> > >you put null? float and double types can cheat a little by folding
> > >negative
> > >and positive NaN's into a single representation (this isn't strictly
> > >correct!), leaving a place to represent null. In the long example case,
> > >the
> > >obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
> > >will allocate an additional encoding which can be used for null. My
> > >experience working with scientific data, however, makes me wince at the
> > >idea.
> > >
> > >The variable-width encodings have it a little easier. There's already
> > >enough going on that it's simpler to make room.
> > >
> > >Remember, the final goal is to support order-preserving serialization.
> > >This
> > >imposes some limitations on our encoding strategies. For instance, it's
> > >not
> > >enough to simply encode null, it really needs to be encoded as 0x00 so
> as
> > >to sort lexicographically earlier than any other value.
> > >
> > >What do you think? Any ideas, experiences, etc?
> > >
> > >Thanks,
> > >Nick
> >
> >
> >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Nick Dimiduk <nd...@gmail.com>.

Thanks for the thoughtful response (and code!).

I'm thinking I will press forward with a base implementation that does not
support nulls. The idea is to provide an extensible set of interfaces, so I
think this will not box us into a corner later. That is, a mirroring
package could be implemented that supports null values and accepts
the relevant trade-offs.

Thanks,
Nick

On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mc...@hotpads.com> wrote:

> I spent some time this weekend extracting bits of our serialization code to
> a public github repo at http://github.com/hotpads/data-tools.
>  Contributions are welcome - i'm sure we all have this stuff laying around.
>
> You can see I've bumped into the NULL problem in a few places:
> *
>
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> *
>
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
>
> Looking back, I think my latest opinion on the topic is to reject
> nullability as the rule since it can cause unexpected behavior and
> confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
> plus NullableLongArrayList) that explicitly defines the behavior, and costs
> a little more in performance.  If the user can't find a pre-made wrapper
> class, it's not very difficult for each user to provide their own
> interpretation of null and check for it themselves.
>
> If you reject nullability, the question becomes what to do in situations
> where you're implementing existing interfaces that accept nullable params.
>  The LongArrayList above implements List<Long> which requires an add(Long)
> method.  In the above implementation I chose to swap nulls with
> Long.MIN_VALUE, however I'm now thinking it best to force the user to make
> that swap and then throw IllegalArgumentException if they pass null.
>
>
> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <doug.meil@explorysmedical.com
> >wrote:
>
> >
> > HmmmŠ good question.
> >
> > I think that fixed width support is important for a great many rowkey
> > constructs cases, so I'd rather see something like losing MIN_VALUE and
> > keeping fixed width.
> >
> >
> >
> >
> > On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
> >
> > >Heya,
> > >
> > >Thinking about data types and serialization. I think null support is an
> > >important characteristic for the serialized representations, especially
> > >when considering the compound type. However, doing so in directly
> > >incompatible with fixed-width representations for numerics. For
> instance,
> > >if we want to have a fixed-width signed long stored on 8-bytes, where do
> > >you put null? float and double types can cheat a little by folding
> > >negative
> > >and positive NaN's into a single representation (this isn't strictly
> > >correct!), leaving a place to represent null. In the long example case,
> > >the
> > >obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
> > >will allocate an additional encoding which can be used for null. My
> > >experience working with scientific data, however, makes me wince at the
> > >idea.
> > >
> > >The variable-width encodings have it a little easier. There's already
> > >enough going on that it's simpler to make room.
> > >
> > >Remember, the final goal is to support order-preserving serialization.
> > >This
> > >imposes some limitations on our encoding strategies. For instance, it's
> > >not
> > >enough to simply encode null, it really needs to be encoded as 0x00 so
> as
> > >to sort lexicographically earlier than any other value.
> > >
> > >What do you think? Any ideas, experiences, etc?
> > >
> > >Thanks,
> > >Nick
> >
> >
> >
> >
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

I spent some time this weekend extracting bits of our serialization code to
a public github repo at http://github.com/hotpads/data-tools.
 Contributions are welcome - i'm sure we all have this stuff laying around.

You can see I've bumped into the NULL problem in a few places:
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and costs
a little more in performance.  If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable params.
 The LongArrayList above implements List<Long> which requires an add(Long)
method.  In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <do...@explorysmedical.com>wrote:

>
> HmmmŠ good question.
>
> I think that fixed width support is important for a great many rowkey
> constructs cases, so I'd rather see something like losing MIN_VALUE and
> keeping fixed width.
>
>
>
>
> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>
> >Heya,
> >
> >Thinking about data types and serialization. I think null support is an
> >important characteristic for the serialized representations, especially
> >when considering the compound type. However, doing so in directly
> >incompatible with fixed-width representations for numerics. For instance,
> >if we want to have a fixed-width signed long stored on 8-bytes, where do
> >you put null? float and double types can cheat a little by folding
> >negative
> >and positive NaN's into a single representation (this isn't strictly
> >correct!), leaving a place to represent null. In the long example case,
> >the
> >obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
> >will allocate an additional encoding which can be used for null. My
> >experience working with scientific data, however, makes me wince at the
> >idea.
> >
> >The variable-width encodings have it a little easier. There's already
> >enough going on that it's simpler to make room.
> >
> >Remember, the final goal is to support order-preserving serialization.
> >This
> >imposes some limitations on our encoding strategies. For instance, it's
> >not
> >enough to simply encode null, it really needs to be encoded as 0x00 so as
> >to sort lexicographically earlier than any other value.
> >
> >What do you think? Any ideas, experiences, etc?
> >
> >Thanks,
> >Nick
>
>
>
>

Re: HBase Types: Explicit Null Support

Posted by Matt Corgan <mc...@hotpads.com>.

I spent some time this weekend extracting bits of our serialization code to
a public github repo at http://github.com/hotpads/data-tools.
 Contributions are welcome - i'm sure we all have this stuff laying around.

You can see I've bumped into the NULL problem in a few places:
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
*
https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java

Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior and
confusion.  It's cleaner to provide a wrapper class (so both LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior, and costs
a little more in performance.  If the user can't find a pre-made wrapper
class, it's not very difficult for each user to provide their own
interpretation of null and check for it themselves.

If you reject nullability, the question becomes what to do in situations
where you're implementing existing interfaces that accept nullable params.
 The LongArrayList above implements List<Long> which requires an add(Long)
method.  In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the user to make
that swap and then throw IllegalArgumentException if they pass null.

On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <do...@explorysmedical.com>wrote:

>
> HmmmŠ good question.
>
> I think that fixed width support is important for a great many rowkey
> constructs cases, so I'd rather see something like losing MIN_VALUE and
> keeping fixed width.
>
>
>
>
> On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:
>
> >Heya,
> >
> >Thinking about data types and serialization. I think null support is an
> >important characteristic for the serialized representations, especially
> >when considering the compound type. However, doing so in directly
> >incompatible with fixed-width representations for numerics. For instance,
> >if we want to have a fixed-width signed long stored on 8-bytes, where do
> >you put null? float and double types can cheat a little by folding
> >negative
> >and positive NaN's into a single representation (this isn't strictly
> >correct!), leaving a place to represent null. In the long example case,
> >the
> >obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
> >will allocate an additional encoding which can be used for null. My
> >experience working with scientific data, however, makes me wince at the
> >idea.
> >
> >The variable-width encodings have it a little easier. There's already
> >enough going on that it's simpler to make room.
> >
> >Remember, the final goal is to support order-preserving serialization.
> >This
> >imposes some limitations on our encoding strategies. For instance, it's
> >not
> >enough to simply encode null, it really needs to be encoded as 0x00 so as
> >to sort lexicographically earlier than any other value.
> >
> >What do you think? Any ideas, experiences, etc?
> >
> >Thanks,
> >Nick
>
>
>
>

Re: HBase Types: Explicit Null Support

Posted by Doug Meil <do...@explorysmedical.com>.

HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.




On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:

>Heya,
>
>Thinking about data types and serialization. I think null support is an
>important characteristic for the serialized representations, especially
>when considering the compound type. However, doing so in directly
>incompatible with fixed-width representations for numerics. For instance,
>if we want to have a fixed-width signed long stored on 8-bytes, where do
>you put null? float and double types can cheat a little by folding
>negative
>and positive NaN's into a single representation (this isn't strictly
>correct!), leaving a place to represent null. In the long example case,
>the
>obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
>will allocate an additional encoding which can be used for null. My
>experience working with scientific data, however, makes me wince at the
>idea.
>
>The variable-width encodings have it a little easier. There's already
>enough going on that it's simpler to make room.
>
>Remember, the final goal is to support order-preserving serialization.
>This
>imposes some limitations on our encoding strategies. For instance, it's
>not
>enough to simply encode null, it really needs to be encoded as 0x00 so as
>to sort lexicographically earlier than any other value.
>
>What do you think? Any ideas, experiences, etc?
>
>Thanks,
>Nick

Re: HBase Types: Explicit Null Support

Posted by Doug Meil <do...@explorysmedical.com>.

HmmmŠ good question.

I think that fixed width support is important for a great many rowkey
constructs cases, so I'd rather see something like losing MIN_VALUE and
keeping fixed width.




On 4/1/13 2:00 PM, "Nick Dimiduk" <nd...@gmail.com> wrote:

>Heya,
>
>Thinking about data types and serialization. I think null support is an
>important characteristic for the serialized representations, especially
>when considering the compound type. However, doing so in directly
>incompatible with fixed-width representations for numerics. For instance,
>if we want to have a fixed-width signed long stored on 8-bytes, where do
>you put null? float and double types can cheat a little by folding
>negative
>and positive NaN's into a single representation (this isn't strictly
>correct!), leaving a place to represent null. In the long example case,
>the
>obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
>will allocate an additional encoding which can be used for null. My
>experience working with scientific data, however, makes me wince at the
>idea.
>
>The variable-width encodings have it a little easier. There's already
>enough going on that it's simpler to make room.
>
>Remember, the final goal is to support order-preserving serialization.
>This
>imposes some limitations on our encoding strategies. For instance, it's
>not
>enough to simply encode null, it really needs to be encoded as 0x00 so as
>to sort lexicographically earlier than any other value.
>
>What do you think? Any ideas, experiences, etc?
>
>Thanks,
>Nick