You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by edward choi <mp...@gmail.com> on 2011/10/04 07:58:30 UTC

Adjusting column value size.

Hi,

I have a question regarding the performance and column value size.
I need to store per row several million integers. ("Several million" is
important here)
I was wondering which method would be more beneficial performance wise.

1) Store each integer to a single column so that when a row is called,
several million columns will also be called. And the user would map each
column values to some kind of container (ex: vector, arrayList)
2) Store, for example, a thousand integers into a single column (by
concatenating them) so that when a row is called, only several thousand
columns will be called along. The user would have to split the column value
into 4 bytes and map the split integer to some kind of container (ex:
vector, arrayList)

I am curious which approach would be better. 1) would call several millions
of columns but no additional process is needed. 2) would call only several
thousands of columns but additional process is needed.
Any advice would be appreciated.

Ed

Re: Adjusting column value size.

Posted by edward choi <mp...@gmail.com>.
Yes, I need all of those ints at the same time. And no, there is no
streaming.

I have decided to pack 1024 ints into one cell so that each cell would be of
size 4kb.
I am already using LZO on my tables.

I'll do some experiments once I finish implementing both approach.
I'll add a thread about the results when I am done.
Thanks for the advice.

Ed.

2011/10/7 Jean-Daniel Cryans <jd...@apache.org>

> (BCC'd common-user@ since this seems strictly HBase related)
>
> Interesting question... And you probably need all those ints at the same
> time right? No streaming? I'll assume no.
>
> So the second solution seems better due to the overhead of storing each
> cell. Basically, storing one int per cell you would end up storing more
> keys
> than values (size wise).
>
> Another thing is that if you pack enough ints together and there's some
> sort
> of repetition, you might be able to use LZO compression on that table.
>
> I'd love to hear about your experimentations once you've done them.
>
> J-D
>
> On Mon, Oct 3, 2011 at 10:58 PM, edward choi <mp...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a question regarding the performance and column value size.
> > I need to store per row several million integers. ("Several million" is
> > important here)
> > I was wondering which method would be more beneficial performance wise.
> >
> > 1) Store each integer to a single column so that when a row is called,
> > several million columns will also be called. And the user would map each
> > column values to some kind of container (ex: vector, arrayList)
> > 2) Store, for example, a thousand integers into a single column (by
> > concatenating them) so that when a row is called, only several thousand
> > columns will be called along. The user would have to split the column
> value
> > into 4 bytes and map the split integer to some kind of container (ex:
> > vector, arrayList)
> >
> > I am curious which approach would be better. 1) would call several
> millions
> > of columns but no additional process is needed. 2) would call only
> several
> > thousands of columns but additional process is needed.
> > Any advice would be appreciated.
> >
> > Ed
> >
>

Re: Adjusting column value size.

Posted by edward choi <mp...@gmail.com>.
Yes, I need all of those ints at the same time. And no, there is no
streaming.

I have decided to pack 1024 ints into one cell so that each cell would be of
size 4kb.
I am already using LZO on my tables.

I'll do some experiments once I finish implementing both approach.
I'll add a thread about the results when I am done.
Thanks for the advice.

Ed.

2011/10/7 Jean-Daniel Cryans <jd...@apache.org>

> (BCC'd common-user@ since this seems strictly HBase related)
>
> Interesting question... And you probably need all those ints at the same
> time right? No streaming? I'll assume no.
>
> So the second solution seems better due to the overhead of storing each
> cell. Basically, storing one int per cell you would end up storing more
> keys
> than values (size wise).
>
> Another thing is that if you pack enough ints together and there's some
> sort
> of repetition, you might be able to use LZO compression on that table.
>
> I'd love to hear about your experimentations once you've done them.
>
> J-D
>
> On Mon, Oct 3, 2011 at 10:58 PM, edward choi <mp...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a question regarding the performance and column value size.
> > I need to store per row several million integers. ("Several million" is
> > important here)
> > I was wondering which method would be more beneficial performance wise.
> >
> > 1) Store each integer to a single column so that when a row is called,
> > several million columns will also be called. And the user would map each
> > column values to some kind of container (ex: vector, arrayList)
> > 2) Store, for example, a thousand integers into a single column (by
> > concatenating them) so that when a row is called, only several thousand
> > columns will be called along. The user would have to split the column
> value
> > into 4 bytes and map the split integer to some kind of container (ex:
> > vector, arrayList)
> >
> > I am curious which approach would be better. 1) would call several
> millions
> > of columns but no additional process is needed. 2) would call only
> several
> > thousands of columns but additional process is needed.
> > Any advice would be appreciated.
> >
> > Ed
> >
>

Re: Adjusting column value size.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
(BCC'd common-user@ since this seems strictly HBase related)

Interesting question... And you probably need all those ints at the same
time right? No streaming? I'll assume no.

So the second solution seems better due to the overhead of storing each
cell. Basically, storing one int per cell you would end up storing more keys
than values (size wise).

Another thing is that if you pack enough ints together and there's some sort
of repetition, you might be able to use LZO compression on that table.

I'd love to hear about your experimentations once you've done them.

J-D

On Mon, Oct 3, 2011 at 10:58 PM, edward choi <mp...@gmail.com> wrote:

> Hi,
>
> I have a question regarding the performance and column value size.
> I need to store per row several million integers. ("Several million" is
> important here)
> I was wondering which method would be more beneficial performance wise.
>
> 1) Store each integer to a single column so that when a row is called,
> several million columns will also be called. And the user would map each
> column values to some kind of container (ex: vector, arrayList)
> 2) Store, for example, a thousand integers into a single column (by
> concatenating them) so that when a row is called, only several thousand
> columns will be called along. The user would have to split the column value
> into 4 bytes and map the split integer to some kind of container (ex:
> vector, arrayList)
>
> I am curious which approach would be better. 1) would call several millions
> of columns but no additional process is needed. 2) would call only several
> thousands of columns but additional process is needed.
> Any advice would be appreciated.
>
> Ed
>

Re: Adjusting column value size.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
(BCC'd common-user@ since this seems strictly HBase related)

Interesting question... And you probably need all those ints at the same
time right? No streaming? I'll assume no.

So the second solution seems better due to the overhead of storing each
cell. Basically, storing one int per cell you would end up storing more keys
than values (size wise).

Another thing is that if you pack enough ints together and there's some sort
of repetition, you might be able to use LZO compression on that table.

I'd love to hear about your experimentations once you've done them.

J-D

On Mon, Oct 3, 2011 at 10:58 PM, edward choi <mp...@gmail.com> wrote:

> Hi,
>
> I have a question regarding the performance and column value size.
> I need to store per row several million integers. ("Several million" is
> important here)
> I was wondering which method would be more beneficial performance wise.
>
> 1) Store each integer to a single column so that when a row is called,
> several million columns will also be called. And the user would map each
> column values to some kind of container (ex: vector, arrayList)
> 2) Store, for example, a thousand integers into a single column (by
> concatenating them) so that when a row is called, only several thousand
> columns will be called along. The user would have to split the column value
> into 4 bytes and map the split integer to some kind of container (ex:
> vector, arrayList)
>
> I am curious which approach would be better. 1) would call several millions
> of columns but no additional process is needed. 2) would call only several
> thousands of columns but additional process is needed.
> Any advice would be appreciated.
>
> Ed
>