You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by TuX RaceR <tu...@gmail.com> on 2010/03/12 20:48:17 UTC

worth choosing the shortest possible column names/keys?

Hello Hbase Users List,

In the SQL world, you can choose column names that clearly describe a 
field (i.e. long names)
I believe it is different in Hbase.
Is it worth choosing the shortest possible column names and keys

ie:
c1234:fn:John,ln:Doe

intead of

customer_1234:FirstName:John,LastName:Doe


?
Will I save a lot of space (especially if I have many small columns)?

Thanks
TuX

Re: worth choosing the shortest possible column names/keys?

Posted by Tim Robertson <ti...@gmail.com>.

Thanks Stack and others from me also.


On Sun, Mar 14, 2010 at 10:22 AM, TuX RaceR <tu...@gmail.com> wrote:

> Thank you guys for your answers. I'll map descriptive names to short name
> too ;)
> Cheers
> TuX
>
>
> Lars Francke wrote:
>
>> Will I save a lot of space (especially if I have many small columns)?
>>>
>>>
>>
>> I don't have any hard numbers for you but I tested it and I remember
>> that on a dataset of about 10-20 GB I could save about 200-500 MB
>> (this was with compression enabled) just by not using descriptive
>> sting qualifiers that weren't data by itself. A lot of small columns
>> for me too (mostly counters). I use a simple mapping of short byte
>> arrays to strings so that it is still very easy to use in the client.
>>
>> I asked that very same question a few months ago on IRC but I think
>> nobody answered so I'd be interested in what others do as well.
>>
>> Cheers,
>> Lars
>>
>>
>
>

Re: worth choosing the shortest possible column names/keys?

Posted by TuX RaceR <tu...@gmail.com>.

Thank you guys for your answers. I'll map descriptive names to short 
name too ;)
Cheers
TuX

Lars Francke wrote:
>> Will I save a lot of space (especially if I have many small columns)?
>>     
>
> I don't have any hard numbers for you but I tested it and I remember
> that on a dataset of about 10-20 GB I could save about 200-500 MB
> (this was with compression enabled) just by not using descriptive
> sting qualifiers that weren't data by itself. A lot of small columns
> for me too (mostly counters). I use a simple mapping of short byte
> arrays to strings so that it is still very easy to use in the client.
>
> I asked that very same question a few months ago on IRC but I think
> nobody answered so I'd be interested in what others do as well.
>
> Cheers,
> Lars
>

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Ryan Rawson <ry...@gmail.com>.

Everything you say is totally true.

One last comment: if your update rate is lowish, and the IDs might
have some meaning, you might be better served by a counter.  eg:
userids (max value=6 billion ;-))  Or something else that might end up
needing to be human semi-readable.

-ryan

On Mon, Mar 15, 2010 at 4:11 PM, Michael Segel
<mi...@hotmail.com> wrote:
>
>
>
>> Date: Mon, 15 Mar 2010 08:15:10 +0100
>> Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible column         names/keys?
>> From: timrobertson100@gmail.com
>> To: hbase-user@hadoop.apache.org
>
>>
>> Sure, understood.  UUID aims to be globally unique, whereas I am only
>> looking for in cluster uniqueness across a couple billion items, but an
>> algorithm that allows ID minting by machines in parallel.
>>
> And if you use a serial counter. You have a single counter and a single point of failure, or a point of contention.
> If you're running a hadoop/mapreduce job and each node inserts in to HBase as they run, then you have to coordinate counter access.
>
> Using UUID, you don't have that problem. Of course, you don't have a sequence that you would using a counter.
>
>
> _________________________________________________________________
> Hotmail: Trusted email with powerful SPAM protection.
> http://clk.atdmt.com/GBL/go/210850553/direct/01/

RE: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Michael Segel <mi...@hotmail.com>.



> Date: Mon, 15 Mar 2010 08:15:10 +0100
> Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible column 	names/keys?
> From: timrobertson100@gmail.com
> To: hbase-user@hadoop.apache.org

> 
> Sure, understood.  UUID aims to be globally unique, whereas I am only
> looking for in cluster uniqueness across a couple billion items, but an
> algorithm that allows ID minting by machines in parallel.
> 
And if you use a serial counter. You have a single counter and a single point of failure, or a point of contention. 
If you're running a hadoop/mapreduce job and each node inserts in to HBase as they run, then you have to coordinate counter access.

Using UUID, you don't have that problem. Of course, you don't have a sequence that you would using a counter.

 		 	   		  
_________________________________________________________________
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/210850553/direct/01/

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by tsuna <ts...@gmail.com>.

On Mon, Mar 15, 2010 at 12:21 AM, Tim Robertson
<ti...@gmail.com> wrote:
> How do you use incrementColumnValue
> To generate a row key please?

You need a "special" row to act as a counter.  This row will typically
contain only a single cell, which stores the counter.  I like to use
the row key { 0 } (a byte array made of a single zeroed byte) for this
special row.  Then you just do an incrementColumnValue on that cell,
and the long you get back, you can transform to a byte array and here
you go, you got your row key.  The counter row is only special in the
sense that it doesn't store actual data, but only a small piece of
meta data that the counter is.

Beware that this entails that all the new rows are always appended at
the end of the same region.  If your workload's performance depends on
your ability to create a large number of rows per second, then this
technique may prove inefficient as you may create a hot spot on the
one region that is being written to.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Ryan Rawson <ry...@gmail.com>.

You certainly can use this call to generate a row id - it works just
like a sequence (from oracle/sql land) object.

I think some people around are using it to generate row ids. The code
should ensure that every number is unique and monotonically
increasing.

-ryan

On Mon, Mar 15, 2010 at 1:21 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Thanks Ryan, sounds ideal
>
> How do you use
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue(byte[],%20byte[],%20byte[],%20long)
>
> To generate a row key please?
>
> Thanks
> Tim
>
>
>
>
>
> On Mon, Mar 15, 2010 at 9:12 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> You can use incrementColumnValue to generate sequential numbers.  The
>> call is atomic and fast.  It supports thousands of calls/second in my
>> testing.
>>
>> -ryan
>>
>> On Mon, Mar 15, 2010 at 12:15 AM, Tim Robertson
>> <ti...@gmail.com> wrote:
>> >>
>> >> Maybe I'm missing something but the UUID is an artificial key, its used
>> to
>> >> guarantee uniqueness and in this case you're using it as part of a
>> key,value
>> >> pair.
>> >>
>> >
>> > Sure, understood.  UUID aims to be globally unique, whereas I am only
>> > looking for in cluster uniqueness across a couple billion items, but an
>> > algorithm that allows ID minting by machines in parallel.
>> >
>> >
>> >> So why are you storing it in a Lucene index as the value?
>> >>
>> >
>> > Because I have various search indexes to the row using combinations of
>> > fields from the row.  I want the whole row accessible in the search
>> results,
>> > so I store the row key only (the row content is way to big for Lucene).
>> >  Lucene handles the search providing the Keys, and then the rows are
>> pulled
>> > and transformed while streaming out in the results.
>> >
>> >
>> >> Look, the benefits of using the UUID definitely outweigh wrapping your
>> own
>> >> solution in 8bytes, even in memory caches.
>> >> (Are you only storing values that are 16 bytes in length, or something
>> much
>> >> larger?)
>> >
>> >
>> > The values are much much larger (100s - 1000s bytes) but they aren't
>> going
>> > in to any in-memory structures.
>> >
>> >
>> >
>> >> > Date: Sun, 14 Mar 2010 19:09:48 +0100
>> >> > Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible
>> >> column         names/keys?
>> >> > From: timrobertson100@gmail.com
>> >> > To: hbase-user@hadoop.apache.org
>> >> >
>> >> > Well I could well be wrong, but my understanding is that there are
>> memory
>> >> > mapped index files using the key, so key choice would come in to play
>> for
>> >> > memory requirements here.  For secondary indexes, it has to be a
>> factor
>> >> for
>> >> > memory requirements- halving the size of the data you need to get in
>> >> memory
>> >> > must be a good thing.  I am also building Lucene indexes storing only
>> >> this
>> >> > key, so it influences their size a fair amount too.
>> >> >
>> >> > I know for sure Mysql (Myisam) btree index size is greatly affected by
>> >> the
>> >> > size of the Numeric types.  They are more complicated that my
>> >> understanding
>> >> > of HBase indexing, but the same principles apply (if it ain't in
>> memory
>> >> then
>> >> > you're into disk seeking).
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <
>> >> michael_segel@hotmail.com>wrote:
>> >> >
>> >> > >
>> >> > >
>> >> > > UUID overkill?
>> >> > > Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely
>> >> not
>> >> > > 'overkill' if all you want the key to do is to guarantee uniqueness.
>> >> > >
>> >> > > Very easy to generate and extremely easy to use. You can even hash
>> it
>> >> and
>> >> > > create version 5 UUIDs.
>> >> > >
>> >> > > I don't understand why you'd want to try and generate an 8 byte (you
>> >> said 8
>> >> > > character, assuming you meant latin-1 characterset), when you have a
>> >> > > standard way of doing it already. 8 byte vs 16 byte?
>> C'mon....really?
>> >> > >
>> >> > > JMHO
>> >> > >
>> >> > > -Mike
>> >> > >
>> >> > > > Date: Sat, 13 Mar 2010 09:01:38 +0100
>> >> > > > Subject: Re: worth choosing the shortest possible column
>> names/keys?
>> >> > > > From: timrobertson100@gmail.com
>> >> > > > To: hbase-user@hadoop.apache.org
>> >> > > >
>> >> > > > Along similar lines... (sorry for hijacking thread)
>> >> > > >
>> >> > > > I assume that this is even more applicable for key choice given
>> the
>> >> way
>> >> > > keys
>> >> > > > participate in indexes?  I have been using UUID, but it is way
>> >> overkill
>> >> > > for
>> >> > > > my needs.  What are others using?  Is there convenient way of
>> doing
>> >> > > (e.g.) 8
>> >> > > > characters strings?
>> >> > > >
>> >> > >
>> >> > >
>> >> > > _________________________________________________________________
>> >> > > Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
>> >> > > http://clk.atdmt.com/GBL/go/210850552/direct/01/
>> >> > >
>> >>
>> >> _________________________________________________________________
>> >> Hotmail is redefining busy with tools for the New Busy. Get more from
>> your
>> >> inbox.
>> >>
>> >>
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2
>> >>
>> >
>>
>

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Tim Robertson <ti...@gmail.com>.

Thanks Ryan, sounds ideal

How do you use
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue(byte[],%20byte[],%20byte[],%20long)

To generate a row key please?

Thanks
Tim





On Mon, Mar 15, 2010 at 9:12 AM, Ryan Rawson <ry...@gmail.com> wrote:

> You can use incrementColumnValue to generate sequential numbers.  The
> call is atomic and fast.  It supports thousands of calls/second in my
> testing.
>
> -ryan
>
> On Mon, Mar 15, 2010 at 12:15 AM, Tim Robertson
> <ti...@gmail.com> wrote:
> >>
> >> Maybe I'm missing something but the UUID is an artificial key, its used
> to
> >> guarantee uniqueness and in this case you're using it as part of a
> key,value
> >> pair.
> >>
> >
> > Sure, understood.  UUID aims to be globally unique, whereas I am only
> > looking for in cluster uniqueness across a couple billion items, but an
> > algorithm that allows ID minting by machines in parallel.
> >
> >
> >> So why are you storing it in a Lucene index as the value?
> >>
> >
> > Because I have various search indexes to the row using combinations of
> > fields from the row.  I want the whole row accessible in the search
> results,
> > so I store the row key only (the row content is way to big for Lucene).
> >  Lucene handles the search providing the Keys, and then the rows are
> pulled
> > and transformed while streaming out in the results.
> >
> >
> >> Look, the benefits of using the UUID definitely outweigh wrapping your
> own
> >> solution in 8bytes, even in memory caches.
> >> (Are you only storing values that are 16 bytes in length, or something
> much
> >> larger?)
> >
> >
> > The values are much much larger (100s - 1000s bytes) but they aren't
> going
> > in to any in-memory structures.
> >
> >
> >
> >> > Date: Sun, 14 Mar 2010 19:09:48 +0100
> >> > Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible
> >> column         names/keys?
> >> > From: timrobertson100@gmail.com
> >> > To: hbase-user@hadoop.apache.org
> >> >
> >> > Well I could well be wrong, but my understanding is that there are
> memory
> >> > mapped index files using the key, so key choice would come in to play
> for
> >> > memory requirements here.  For secondary indexes, it has to be a
> factor
> >> for
> >> > memory requirements- halving the size of the data you need to get in
> >> memory
> >> > must be a good thing.  I am also building Lucene indexes storing only
> >> this
> >> > key, so it influences their size a fair amount too.
> >> >
> >> > I know for sure Mysql (Myisam) btree index size is greatly affected by
> >> the
> >> > size of the Numeric types.  They are more complicated that my
> >> understanding
> >> > of HBase indexing, but the same principles apply (if it ain't in
> memory
> >> then
> >> > you're into disk seeking).
> >> >
> >> >
> >> >
> >> > On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <
> >> michael_segel@hotmail.com>wrote:
> >> >
> >> > >
> >> > >
> >> > > UUID overkill?
> >> > > Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely
> >> not
> >> > > 'overkill' if all you want the key to do is to guarantee uniqueness.
> >> > >
> >> > > Very easy to generate and extremely easy to use. You can even hash
> it
> >> and
> >> > > create version 5 UUIDs.
> >> > >
> >> > > I don't understand why you'd want to try and generate an 8 byte (you
> >> said 8
> >> > > character, assuming you meant latin-1 characterset), when you have a
> >> > > standard way of doing it already. 8 byte vs 16 byte?
> C'mon....really?
> >> > >
> >> > > JMHO
> >> > >
> >> > > -Mike
> >> > >
> >> > > > Date: Sat, 13 Mar 2010 09:01:38 +0100
> >> > > > Subject: Re: worth choosing the shortest possible column
> names/keys?
> >> > > > From: timrobertson100@gmail.com
> >> > > > To: hbase-user@hadoop.apache.org
> >> > > >
> >> > > > Along similar lines... (sorry for hijacking thread)
> >> > > >
> >> > > > I assume that this is even more applicable for key choice given
> the
> >> way
> >> > > keys
> >> > > > participate in indexes?  I have been using UUID, but it is way
> >> overkill
> >> > > for
> >> > > > my needs.  What are others using?  Is there convenient way of
> doing
> >> > > (e.g.) 8
> >> > > > characters strings?
> >> > > >
> >> > >
> >> > >
> >> > > _________________________________________________________________
> >> > > Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
> >> > > http://clk.atdmt.com/GBL/go/210850552/direct/01/
> >> > >
> >>
> >> _________________________________________________________________
> >> Hotmail is redefining busy with tools for the New Busy. Get more from
> your
> >> inbox.
> >>
> >>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2
> >>
> >
>

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Ryan Rawson <ry...@gmail.com>.

You can use incrementColumnValue to generate sequential numbers.  The
call is atomic and fast.  It supports thousands of calls/second in my
testing.

-ryan

On Mon, Mar 15, 2010 at 12:15 AM, Tim Robertson
<ti...@gmail.com> wrote:
>>
>> Maybe I'm missing something but the UUID is an artificial key, its used to
>> guarantee uniqueness and in this case you're using it as part of a key,value
>> pair.
>>
>
> Sure, understood.  UUID aims to be globally unique, whereas I am only
> looking for in cluster uniqueness across a couple billion items, but an
> algorithm that allows ID minting by machines in parallel.
>
>
>> So why are you storing it in a Lucene index as the value?
>>
>
> Because I have various search indexes to the row using combinations of
> fields from the row.  I want the whole row accessible in the search results,
> so I store the row key only (the row content is way to big for Lucene).
>  Lucene handles the search providing the Keys, and then the rows are pulled
> and transformed while streaming out in the results.
>
>
>> Look, the benefits of using the UUID definitely outweigh wrapping your own
>> solution in 8bytes, even in memory caches.
>> (Are you only storing values that are 16 bytes in length, or something much
>> larger?)
>
>
> The values are much much larger (100s - 1000s bytes) but they aren't going
> in to any in-memory structures.
>
>
>
>> > Date: Sun, 14 Mar 2010 19:09:48 +0100
>> > Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible
>> column         names/keys?
>> > From: timrobertson100@gmail.com
>> > To: hbase-user@hadoop.apache.org
>> >
>> > Well I could well be wrong, but my understanding is that there are memory
>> > mapped index files using the key, so key choice would come in to play for
>> > memory requirements here.  For secondary indexes, it has to be a factor
>> for
>> > memory requirements- halving the size of the data you need to get in
>> memory
>> > must be a good thing.  I am also building Lucene indexes storing only
>> this
>> > key, so it influences their size a fair amount too.
>> >
>> > I know for sure Mysql (Myisam) btree index size is greatly affected by
>> the
>> > size of the Numeric types.  They are more complicated that my
>> understanding
>> > of HBase indexing, but the same principles apply (if it ain't in memory
>> then
>> > you're into disk seeking).
>> >
>> >
>> >
>> > On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <
>> michael_segel@hotmail.com>wrote:
>> >
>> > >
>> > >
>> > > UUID overkill?
>> > > Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely
>> not
>> > > 'overkill' if all you want the key to do is to guarantee uniqueness.
>> > >
>> > > Very easy to generate and extremely easy to use. You can even hash it
>> and
>> > > create version 5 UUIDs.
>> > >
>> > > I don't understand why you'd want to try and generate an 8 byte (you
>> said 8
>> > > character, assuming you meant latin-1 characterset), when you have a
>> > > standard way of doing it already. 8 byte vs 16 byte? C'mon....really?
>> > >
>> > > JMHO
>> > >
>> > > -Mike
>> > >
>> > > > Date: Sat, 13 Mar 2010 09:01:38 +0100
>> > > > Subject: Re: worth choosing the shortest possible column names/keys?
>> > > > From: timrobertson100@gmail.com
>> > > > To: hbase-user@hadoop.apache.org
>> > > >
>> > > > Along similar lines... (sorry for hijacking thread)
>> > > >
>> > > > I assume that this is even more applicable for key choice given the
>> way
>> > > keys
>> > > > participate in indexes?  I have been using UUID, but it is way
>> overkill
>> > > for
>> > > > my needs.  What are others using?  Is there convenient way of doing
>> > > (e.g.) 8
>> > > > characters strings?
>> > > >
>> > >
>> > >
>> > > _________________________________________________________________
>> > > Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
>> > > http://clk.atdmt.com/GBL/go/210850552/direct/01/
>> > >
>>
>> _________________________________________________________________
>> Hotmail is redefining busy with tools for the New Busy. Get more from your
>> inbox.
>>
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2
>>
>

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Tim Robertson <ti...@gmail.com>.

>
> Maybe I'm missing something but the UUID is an artificial key, its used to
> guarantee uniqueness and in this case you're using it as part of a key,value
> pair.
>

Sure, understood.  UUID aims to be globally unique, whereas I am only
looking for in cluster uniqueness across a couple billion items, but an
algorithm that allows ID minting by machines in parallel.


> So why are you storing it in a Lucene index as the value?
>

Because I have various search indexes to the row using combinations of
fields from the row.  I want the whole row accessible in the search results,
so I store the row key only (the row content is way to big for Lucene).
 Lucene handles the search providing the Keys, and then the rows are pulled
and transformed while streaming out in the results.


> Look, the benefits of using the UUID definitely outweigh wrapping your own
> solution in 8bytes, even in memory caches.
> (Are you only storing values that are 16 bytes in length, or something much
> larger?)


The values are much much larger (100s - 1000s bytes) but they aren't going
in to any in-memory structures.



> > Date: Sun, 14 Mar 2010 19:09:48 +0100
> > Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible
> column         names/keys?
> > From: timrobertson100@gmail.com
> > To: hbase-user@hadoop.apache.org
> >
> > Well I could well be wrong, but my understanding is that there are memory
> > mapped index files using the key, so key choice would come in to play for
> > memory requirements here.  For secondary indexes, it has to be a factor
> for
> > memory requirements- halving the size of the data you need to get in
> memory
> > must be a good thing.  I am also building Lucene indexes storing only
> this
> > key, so it influences their size a fair amount too.
> >
> > I know for sure Mysql (Myisam) btree index size is greatly affected by
> the
> > size of the Numeric types.  They are more complicated that my
> understanding
> > of HBase indexing, but the same principles apply (if it ain't in memory
> then
> > you're into disk seeking).
> >
> >
> >
> > On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <
> michael_segel@hotmail.com>wrote:
> >
> > >
> > >
> > > UUID overkill?
> > > Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely
> not
> > > 'overkill' if all you want the key to do is to guarantee uniqueness.
> > >
> > > Very easy to generate and extremely easy to use. You can even hash it
> and
> > > create version 5 UUIDs.
> > >
> > > I don't understand why you'd want to try and generate an 8 byte (you
> said 8
> > > character, assuming you meant latin-1 characterset), when you have a
> > > standard way of doing it already. 8 byte vs 16 byte? C'mon....really?
> > >
> > > JMHO
> > >
> > > -Mike
> > >
> > > > Date: Sat, 13 Mar 2010 09:01:38 +0100
> > > > Subject: Re: worth choosing the shortest possible column names/keys?
> > > > From: timrobertson100@gmail.com
> > > > To: hbase-user@hadoop.apache.org
> > > >
> > > > Along similar lines... (sorry for hijacking thread)
> > > >
> > > > I assume that this is even more applicable for key choice given the
> way
> > > keys
> > > > participate in indexes?  I have been using UUID, but it is way
> overkill
> > > for
> > > > my needs.  What are others using?  Is there convenient way of doing
> > > (e.g.) 8
> > > > characters strings?
> > > >
> > >
> > >
> > > _________________________________________________________________
> > > Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
> > > http://clk.atdmt.com/GBL/go/210850552/direct/01/
> > >
>
> _________________________________________________________________
> Hotmail is redefining busy with tools for the New Busy. Get more from your
> inbox.
>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2
>

RE: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Michael Segel <mi...@hotmail.com>.


Maybe I'm missing something but the UUID is an artificial key, its used to guarantee uniqueness and in this case you're using it as part of a key,value pair.

So why are you storing it in a Lucene index as the value?

Look, the benefits of using the UUID definitely outweigh wrapping your own solution in 8bytes, even in memory caches.
(Are you only storing values that are 16 bytes in length, or something much larger?) 


> Date: Sun, 14 Mar 2010 19:09:48 +0100
> Subject: Re: UUID as key wuz: RE: worth choosing the shortest possible column 	names/keys?
> From: timrobertson100@gmail.com
> To: hbase-user@hadoop.apache.org
> 
> Well I could well be wrong, but my understanding is that there are memory
> mapped index files using the key, so key choice would come in to play for
> memory requirements here.  For secondary indexes, it has to be a factor for
> memory requirements- halving the size of the data you need to get in memory
> must be a good thing.  I am also building Lucene indexes storing only this
> key, so it influences their size a fair amount too.
> 
> I know for sure Mysql (Myisam) btree index size is greatly affected by the
> size of the Numeric types.  They are more complicated that my understanding
> of HBase indexing, but the same principles apply (if it ain't in memory then
> you're into disk seeking).
> 
> 
> 
> On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
> >
> >
> > UUID overkill?
> > Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely not
> > 'overkill' if all you want the key to do is to guarantee uniqueness.
> >
> > Very easy to generate and extremely easy to use. You can even hash it and
> > create version 5 UUIDs.
> >
> > I don't understand why you'd want to try and generate an 8 byte (you said 8
> > character, assuming you meant latin-1 characterset), when you have a
> > standard way of doing it already. 8 byte vs 16 byte? C'mon....really?
> >
> > JMHO
> >
> > -Mike
> >
> > > Date: Sat, 13 Mar 2010 09:01:38 +0100
> > > Subject: Re: worth choosing the shortest possible column names/keys?
> > > From: timrobertson100@gmail.com
> > > To: hbase-user@hadoop.apache.org
> > >
> > > Along similar lines... (sorry for hijacking thread)
> > >
> > > I assume that this is even more applicable for key choice given the way
> > keys
> > > participate in indexes?  I have been using UUID, but it is way overkill
> > for
> > > my needs.  What are others using?  Is there convenient way of doing
> > (e.g.) 8
> > > characters strings?
> > >
> >
> >
> > _________________________________________________________________
> > Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
> > http://clk.atdmt.com/GBL/go/210850552/direct/01/
> >
 		 	   		  
_________________________________________________________________
Hotmail is redefining busy with tools for the New Busy. Get more from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID27925::T:WLMTAGL:ON:WL:en-US:WM_HMP:032010_2

Re: UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Tim Robertson <ti...@gmail.com>.

Well I could well be wrong, but my understanding is that there are memory
mapped index files using the key, so key choice would come in to play for
memory requirements here.  For secondary indexes, it has to be a factor for
memory requirements- halving the size of the data you need to get in memory
must be a good thing.  I am also building Lucene indexes storing only this
key, so it influences their size a fair amount too.

I know for sure Mysql (Myisam) btree index size is greatly affected by the
size of the Numeric types.  They are more complicated that my understanding
of HBase indexing, but the same principles apply (if it ain't in memory then
you're into disk seeking).

On Sun, Mar 14, 2010 at 6:41 PM, Michael Segel <mi...@hotmail.com>wrote:

>
>
> UUID overkill?
> Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely not
> 'overkill' if all you want the key to do is to guarantee uniqueness.
>
> Very easy to generate and extremely easy to use. You can even hash it and
> create version 5 UUIDs.
>
> I don't understand why you'd want to try and generate an 8 byte (you said 8
> character, assuming you meant latin-1 characterset), when you have a
> standard way of doing it already. 8 byte vs 16 byte? C'mon....really?
>
> JMHO
>
> -Mike
>
> > Date: Sat, 13 Mar 2010 09:01:38 +0100
> > Subject: Re: worth choosing the shortest possible column names/keys?
> > From: timrobertson100@gmail.com
> > To: hbase-user@hadoop.apache.org
> >
> > Along similar lines... (sorry for hijacking thread)
> >
> > I assume that this is even more applicable for key choice given the way
> keys
> > participate in indexes?  I have been using UUID, but it is way overkill
> for
> > my needs.  What are others using?  Is there convenient way of doing
> (e.g.) 8
> > characters strings?
> >
>
>
> _________________________________________________________________
> Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
> http://clk.atdmt.com/GBL/go/210850552/direct/01/
>

UUID as key wuz: RE: worth choosing the shortest possible column names/keys?

Posted by Michael Segel <mi...@hotmail.com>.


UUID overkill?
Uhm uuid is a 128bit key. That's what 16 bytes in length? Definitely not 'overkill' if all you want the key to do is to guarantee uniqueness. 

Very easy to generate and extremely easy to use. You can even hash it and create version 5 UUIDs.

I don't understand why you'd want to try and generate an 8 byte (you said 8 character, assuming you meant latin-1 characterset), when you have a standard way of doing it already. 8 byte vs 16 byte? C'mon....really?

JMHO

-Mike
 
> Date: Sat, 13 Mar 2010 09:01:38 +0100
> Subject: Re: worth choosing the shortest possible column names/keys?
> From: timrobertson100@gmail.com
> To: hbase-user@hadoop.apache.org
> 
> Along similar lines... (sorry for hijacking thread)
> 
> I assume that this is even more applicable for key choice given the way keys
> participate in indexes?  I have been using UUID, but it is way overkill for
> my needs.  What are others using?  Is there convenient way of doing (e.g.) 8
> characters strings?
> 

 		 	   		  
_________________________________________________________________
Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
http://clk.atdmt.com/GBL/go/210850552/direct/01/

Re: worth choosing the shortest possible column names/keys?

Posted by Stack <st...@duboce.net>.

You looked at the murmurhash implementation that is in hbase Tim?   It
has good characteristics -- faster than jenkins and 32bit or 64bit
product.  See http://sites.google.com/site/murmurhash/.  Convertion to
java was done by Andrzej.  Way cheaper than UUID'ing and much smaller.

St.Ack

On Sat, Mar 13, 2010 at 12:01 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Along similar lines... (sorry for hijacking thread)
>
> I assume that this is even more applicable for key choice given the way keys
> participate in indexes?  I have been using UUID, but it is way overkill for
> my needs.  What are others using?  Is there convenient way of doing (e.g.) 8
> characters strings?
>
>
>
>
> On Fri, Mar 12, 2010 at 9:15 PM, Kay Kay <ka...@gmail.com> wrote:
>
>> Some of our current experiences go along similiar lines , where we saw a
>> ~20-30% of ram savings by using abbreviations in the key space.
>>
>> But the biggest advantage came actually with defining the right schema and
>> column families, as per the query pattern of the jobs. We keep the column
>> families no more than 5 and have relatively *thin* columns , but revisit the
>> schema with more tables , if that gets stretched , as applicable of course.
>>
>>
>>
>>
>> On 3/12/10 12:02 PM, Lars Francke wrote:
>>
>>> Will I save a lot of space (especially if I have many small columns)?
>>>>
>>>>
>>> I don't have any hard numbers for you but I tested it and I remember
>>> that on a dataset of about 10-20 GB I could save about 200-500 MB
>>> (this was with compression enabled) just by not using descriptive
>>> sting qualifiers that weren't data by itself. A lot of small columns
>>> for me too (mostly counters). I use a simple mapping of short byte
>>> arrays to strings so that it is still very easy to use in the client.
>>>
>>> I asked that very same question a few months ago on IRC but I think
>>> nobody answered so I'd be interested in what others do as well.
>>>
>>> Cheers,
>>> Lars
>>>
>>>
>>
>>
>

Re: worth choosing the shortest possible column names/keys?

Posted by Tim Robertson <ti...@gmail.com>.

Along similar lines... (sorry for hijacking thread)

I assume that this is even more applicable for key choice given the way keys
participate in indexes?  I have been using UUID, but it is way overkill for
my needs.  What are others using?  Is there convenient way of doing (e.g.) 8
characters strings?




On Fri, Mar 12, 2010 at 9:15 PM, Kay Kay <ka...@gmail.com> wrote:

> Some of our current experiences go along similiar lines , where we saw a
> ~20-30% of ram savings by using abbreviations in the key space.
>
> But the biggest advantage came actually with defining the right schema and
> column families, as per the query pattern of the jobs. We keep the column
> families no more than 5 and have relatively *thin* columns , but revisit the
> schema with more tables , if that gets stretched , as applicable of course.
>
>
>
>
> On 3/12/10 12:02 PM, Lars Francke wrote:
>
>> Will I save a lot of space (especially if I have many small columns)?
>>>
>>>
>> I don't have any hard numbers for you but I tested it and I remember
>> that on a dataset of about 10-20 GB I could save about 200-500 MB
>> (this was with compression enabled) just by not using descriptive
>> sting qualifiers that weren't data by itself. A lot of small columns
>> for me too (mostly counters). I use a simple mapping of short byte
>> arrays to strings so that it is still very easy to use in the client.
>>
>> I asked that very same question a few months ago on IRC but I think
>> nobody answered so I'd be interested in what others do as well.
>>
>> Cheers,
>> Lars
>>
>>
>
>

Re: worth choosing the shortest possible column names/keys?

Posted by Kay Kay <ka...@gmail.com>.

Some of our current experiences go along similiar lines , where we saw a 
~20-30% of ram savings by using abbreviations in the key space.

But the biggest advantage came actually with defining the right schema 
and column families, as per the query pattern of the jobs. We keep the 
column families no more than 5 and have relatively *thin* columns , but 
revisit the schema with more tables , if that gets stretched , as 
applicable of course.

On 3/12/10 12:02 PM, Lars Francke wrote:
>> Will I save a lot of space (especially if I have many small columns)?
>>      
> I don't have any hard numbers for you but I tested it and I remember
> that on a dataset of about 10-20 GB I could save about 200-500 MB
> (this was with compression enabled) just by not using descriptive
> sting qualifiers that weren't data by itself. A lot of small columns
> for me too (mostly counters). I use a simple mapping of short byte
> arrays to strings so that it is still very easy to use in the client.
>
> I asked that very same question a few months ago on IRC but I think
> nobody answered so I'd be interested in what others do as well.
>
> Cheers,
> Lars
>

Re: worth choosing the shortest possible column names/keys?

Posted by Lars Francke <la...@gmail.com>.

> Will I save a lot of space (especially if I have many small columns)?

I don't have any hard numbers for you but I tested it and I remember
that on a dataset of about 10-20 GB I could save about 200-500 MB
(this was with compression enabled) just by not using descriptive
sting qualifiers that weren't data by itself. A lot of small columns
for me too (mostly counters). I use a simple mapping of short byte
arrays to strings so that it is still very easy to use in the client.

I asked that very same question a few months ago on IRC but I think
nobody answered so I'd be interested in what others do as well.

Cheers,
Lars