You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Derek Wollenstein <de...@klout.com> on 2012/04/02 22:10:47 UTC

Key Design Question for list data

We're looking at how to store a large amount of (per-user) list data in
hbase, and we were trying to figure out what kind of access pattern made
the most sense.  One option is store the majority of the data in a key, so
we could have something like
<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we hade was to do this entirely using
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values.
So in one case reading the first thirty values would be
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of
these lists, with infrequent access reading deeper into the lists.  Some
users would have <= 30 total values in these lists, and some users would
have millions (i.e. power-law distribution)

   The single-value format seems like it would take up more space on hbase,
but would offer some improved retrieval / pagination flexibility.  Would
there be any significant performance advantages to be able to paginate via
gets vs paginating with scans?
    My initial understanding was that doing a scan should be faster if our
paging size is unknown (and caching is set appropriately), but that gets
should be faster if we'll always need the same page size.  I've ended up
hearing different people tell me opposite things about performance.  I
assume the page sizes would be relatively consistent, so for most use cases
we could guarantee that we only wanted one page of data in the
fixed-page-length case.  I would also assume that we would have infrequent
updates, but may have inserts into the middle of these lists (meaning we'd
need to update all subsequent rows).

Thanks for help / suggestions / followup questions

--Derek

Re: Key Design Question for list data

Posted by Derek Wollenstein <de...@klout.com>.

Ian --
  Thanks again for the clarifications.  When I mentioned the pagination
idea, I was proposing to put my own encoding format inside of the value, so
it would look like the following (using JSON just to make the format more
writable)
user123:0 ->
 {
   "nextPage" = 1,
   "values" = [ "val1","val2","val3"/*, ... */ ]
 }

So in that case, the user123 would occur once, vs
user123:val1 -> "'
user123:val2 -> ""
user123:val3 -> ""

In our case the number of values will be somewhere from the hundred to the
millions with a power-law distribution, so I think the savings could be
significant if we batch this in pages of 100.

That said, I also have heard about the upcoming prefix compression work in
hbase 0.94.  It's one of the things pushing me  towards the simpler format.
On Wed, Apr 4, 2012 at 11:15 AM, Ian Varley <iv...@salesforce.com> wrote:

> Great, glad to help Derek! One clarification, on space usage: on disk, the
> row key (and column family) is repeated for every cell. So if you have rows
> like:
>
> user123:  columnfamily1: col1:val1, col2:val2, col3:val3
> user234:  columnfamily1: col1:val1, col2:val2, col3:val3
>
> What you really get on disk is:
>
> user123:columnfamily1:col1=val1
> user123:columnfamily1:col2=val2
> user123:columnfamily1:col3=val3
> user234:columnfamily1:col1=val1
> user234:columnfamily1:col2=val2
> user234:columnfamily1:col3=val3
>
> So, it really is roughly the same order of storage, whether you make it
> tall or wide. The key difference is that you get atomicity on a single row,
> so you get the benefits (and detriments) of that if you choose that
> approach.
>
> Note also that storefile compression *heavily* mitigates this repeating on
> disk (in my experience, you end up getting space usage on par with an RDBMS
> (or better) if you use compression). And prefix compression in memory is on
> its way, too; so it's important to try it out in practice and see how those
> things change the storage / cpu picture for you. :)
>
> Ian
>
> On Apr 4, 2012, at 1:01 PM, Derek Wollenstein wrote:
>
> Ian --
>  Thanks for the summary -- I think this concurs with the kind of
> questions I had.  It does seem like we would still save *some* space with a
> paginated implementation (user123 is repeated fewer times, even if
> everything else is stored the same number of times), but I agree that the
> savings may not be that large.  I was just trying to see if there was any
> strong experiences either way.  It's true that we can start simple, and
> then optimize.  My main concern is that reloading everything if we make a
> bad choice is somewhat expensive as well.  Thanks for the excellent summary
> of the issue.  I'll need to go ahead and talk with our other engineers
> about reverting to a simpler implementation.
>
> --Derek
>
> On Tue, Apr 3, 2012 at 12:59 PM, Ian Varley <ivarley@salesforce.com
> <ma...@salesforce.com>> wrote:
>
> Hi Derek,
>
> If I understand you correctly, you're ultimately trying to store triples
> in the form "user, valueid, value", right? E.g., something like:
>
> "user123, firstname, Paul",
> "user234, lastname, Smith"
>
> (But the usernames are fixed width, and the valueids are fixed width).
>
> And, your access pattern is along the lines of: "for user X, list the next
> 30 values, starting with valueid Y". Is that right? And these values should
> be returned sorted by valueid?
>
> The tl;dr version is that you should probably go with one row per
> user+value, and not build a complicated intra-row pagination scheme on your
> own unless you're really sure it is needed.
>
> Your two options mirror a common question people have when designing HBase
> schemas: should I go "tall" or "wide"? Your first schema is "tall": each
> row represents one value for one user, and so there are many rows in the
> table for each user; the row key is user + valueid, and there would be
> (presumably) a single column qualifier that means "the value". This is
> great if you want to scan over rows in sorted order by row key (thus my
> question above, about whether these ids are sorted correctly). You can
> start a scan at any user+valueid, read the next 30, and be done. What
> you're giving up is the ability to have transactional guarantees around all
> the rows for one user, but it doesn't sound like you need that. Doing it
> this way is generally recommended (see here<
> http://hbase.apache.org/book.html#schema.smackdown>).
>
> Your second option is "wide": you store a bunch of values in one row,
> using different qualifiers (where the qualifier is the valueid). The simple
> way to do that would be to just store ALL values for one user in a single
> row. I'm guessing you jumped to the "paginated" version because you're
> assuming that storing millions of columns in a single row would be bad for
> performance, which may or may not be true; as long as you're not trying to
> do too much in a single request, or do things like scanning over and
> returning all of the cells in the row, it shouldn't be fundamentally worse.
> The client has methods that allow you to get specific slices of columns.
>
> Note that neither case fundamentally uses more disk space than the other;
> you're just "shifting" part of the identifying information for a value
> either to the left (into the row key, in option one) or to the right (into
> the column qualifiers in option 2). Under the covers, every key/value still
> stores the whole row key, and column family name. (If this is a bit
> confusing, take an hour and watch Lars George's excellent video about
> understanding HBase schema design:
> http://www.youtube.com/watch?v=_HLoH_PgrLk).
>
> A manually paginated version has lots more complexities, as you note, like
> having to keep track of how many things are in each page, re-shuffling if
> new values are inserted, etc. That seems significantly more complex. It
> might have some slight speed advantages (or disadvantages!) at extremely
> high throughput, and the only way to really know that would be to try it
> out. If you don't have time to build it both ways and compare, my advice
> would be to start with the simplest option (one row per user+value). Start
> simple & iterate! :)
>
> (Let me know if I've misunderstood your situation.)
>
> Ian
>
> On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:
>
> We're looking at how to store a large amount of (per-user) list data in
> hbase, and we were trying to figure out what kind of access pattern made
> the most sense.  One option is store the majority of the data in a key, so
> we could have something like
> <FixedWidthUserName><FixedWidthValueId1>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId2>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
>
> The other option we hade was to do this entirely using
>
>
> <FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
>
> <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
> where each row would contain multiple values.
> So in one case reading the first thirty values would be
> scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
> And in the second case it would be
> get 'FixedWidthUserName\x00\x00\x00\x00'
>
> The general usage pattern would be to read only the first 30 values of
> these lists, with infrequent access reading deeper into the lists.  Some
> users would have <= 30 total values in these lists, and some users would
> have millions (i.e. power-law distribution)
>
> The single-value format seems like it would take up more space on hbase,
> but would offer some improved retrieval / pagination flexibility.  Would
> there be any significant performance advantages to be able to paginate via
> gets vs paginating with scans?
>  My initial understanding was that doing a scan should be faster if our
> paging size is unknown (and caching is set appropriately), but that gets
> should be faster if we'll always need the same page size.  I've ended up
> hearing different people tell me opposite things about performance.  I
> assume the page sizes would be relatively consistent, so for most use cases
> we could guarantee that we only wanted one page of data in the
> fixed-page-length case.  I would also assume that we would have infrequent
> updates, but may have inserts into the middle of these lists (meaning we'd
> need to update all subsequent rows).
>
> Thanks for help / suggestions / followup questions
>
> --Derek
>
>
>
>

Re: Key Design Question for list data

Posted by Ian Varley <iv...@salesforce.com>.

Great, glad to help Derek! One clarification, on space usage: on disk, the row key (and column family) is repeated for every cell. So if you have rows like:

user123:  columnfamily1: col1:val1, col2:val2, col3:val3
user234:  columnfamily1: col1:val1, col2:val2, col3:val3

What you really get on disk is:

user123:columnfamily1:col1=val1
user123:columnfamily1:col2=val2
user123:columnfamily1:col3=val3
user234:columnfamily1:col1=val1
user234:columnfamily1:col2=val2
user234:columnfamily1:col3=val3

So, it really is roughly the same order of storage, whether you make it tall or wide. The key difference is that you get atomicity on a single row, so you get the benefits (and detriments) of that if you choose that approach.

Note also that storefile compression *heavily* mitigates this repeating on disk (in my experience, you end up getting space usage on par with an RDBMS (or better) if you use compression). And prefix compression in memory is on its way, too; so it's important to try it out in practice and see how those things change the storage / cpu picture for you. :)

Ian

On Apr 4, 2012, at 1:01 PM, Derek Wollenstein wrote:

Ian --
  Thanks for the summary -- I think this concurs with the kind of
questions I had.  It does seem like we would still save *some* space with a
paginated implementation (user123 is repeated fewer times, even if
everything else is stored the same number of times), but I agree that the
savings may not be that large.  I was just trying to see if there was any
strong experiences either way.  It's true that we can start simple, and
then optimize.  My main concern is that reloading everything if we make a
bad choice is somewhat expensive as well.  Thanks for the excellent summary
of the issue.  I'll need to go ahead and talk with our other engineers
about reverting to a simpler implementation.

--Derek

On Tue, Apr 3, 2012 at 12:59 PM, Ian Varley <iv...@salesforce.com>> wrote:

Hi Derek,

If I understand you correctly, you're ultimately trying to store triples
in the form "user, valueid, value", right? E.g., something like:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

And, your access pattern is along the lines of: "for user X, list the next
30 values, starting with valueid Y". Is that right? And these values should
be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per
user+value, and not build a complicated intra-row pagination scheme on your
own unless you're really sure it is needed.

Your two options mirror a common question people have when designing HBase
schemas: should I go "tall" or "wide"? Your first schema is "tall": each
row represents one value for one user, and so there are many rows in the
table for each user; the row key is user + valueid, and there would be
(presumably) a single column qualifier that means "the value". This is
great if you want to scan over rows in sorted order by row key (thus my
question above, about whether these ids are sorted correctly). You can
start a scan at any user+valueid, read the next 30, and be done. What
you're giving up is the ability to have transactional guarantees around all
the rows for one user, but it doesn't sound like you need that. Doing it
this way is generally recommended (see here<
http://hbase.apache.org/book.html#schema.smackdown>).

Your second option is "wide": you store a bunch of values in one row,
using different qualifiers (where the qualifier is the valueid). The simple
way to do that would be to just store ALL values for one user in a single
row. I'm guessing you jumped to the "paginated" version because you're
assuming that storing millions of columns in a single row would be bad for
performance, which may or may not be true; as long as you're not trying to
do too much in a single request, or do things like scanning over and
returning all of the cells in the row, it shouldn't be fundamentally worse.
The client has methods that allow you to get specific slices of columns.

Note that neither case fundamentally uses more disk space than the other;
you're just "shifting" part of the identifying information for a value
either to the left (into the row key, in option one) or to the right (into
the column qualifiers in option 2). Under the covers, every key/value still
stores the whole row key, and column family name. (If this is a bit
confusing, take an hour and watch Lars George's excellent video about
understanding HBase schema design:
http://www.youtube.com/watch?v=_HLoH_PgrLk).

A manually paginated version has lots more complexities, as you note, like
having to keep track of how many things are in each page, re-shuffling if
new values are inserted, etc. That seems significantly more complex. It
might have some slight speed advantages (or disadvantages!) at extremely
high throughput, and the only way to really know that would be to try it
out. If you don't have time to build it both ways and compare, my advice
would be to start with the simplest option (one row per user+value). Start
simple & iterate! :)

(Let me know if I've misunderstood your situation.)

Ian

On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:

We're looking at how to store a large amount of (per-user) list data in
hbase, and we were trying to figure out what kind of access pattern made
the most sense.  One option is store the majority of the data in a key, so
we could have something like
<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we hade was to do this entirely using

<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values.
So in one case reading the first thirty values would be
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of
these lists, with infrequent access reading deeper into the lists.  Some
users would have <= 30 total values in these lists, and some users would
have millions (i.e. power-law distribution)

The single-value format seems like it would take up more space on hbase,
but would offer some improved retrieval / pagination flexibility.  Would
there be any significant performance advantages to be able to paginate via
gets vs paginating with scans?
 My initial understanding was that doing a scan should be faster if our
paging size is unknown (and caching is set appropriately), but that gets
should be faster if we'll always need the same page size.  I've ended up
hearing different people tell me opposite things about performance.  I
assume the page sizes would be relatively consistent, so for most use cases
we could guarantee that we only wanted one page of data in the
fixed-page-length case.  I would also assume that we would have infrequent
updates, but may have inserts into the middle of these lists (meaning we'd
need to update all subsequent rows).

Thanks for help / suggestions / followup questions

--Derek

Re: Key Design Question for list data

Posted by Derek Wollenstein <de...@klout.com>.

Ian --
   Thanks for the summary -- I think this concurs with the kind of
questions I had.  It does seem like we would still save *some* space with a
paginated implementation (user123 is repeated fewer times, even if
everything else is stored the same number of times), but I agree that the
savings may not be that large.  I was just trying to see if there was any
strong experiences either way.  It's true that we can start simple, and
then optimize.  My main concern is that reloading everything if we make a
bad choice is somewhat expensive as well.  Thanks for the excellent summary
of the issue.  I'll need to go ahead and talk with our other engineers
about reverting to a simpler implementation.

--Derek

On Tue, Apr 3, 2012 at 12:59 PM, Ian Varley <iv...@salesforce.com> wrote:

> Hi Derek,
>
> If I understand you correctly, you're ultimately trying to store triples
> in the form "user, valueid, value", right? E.g., something like:
>
> "user123, firstname, Paul",
> "user234, lastname, Smith"
>
> (But the usernames are fixed width, and the valueids are fixed width).
>
> And, your access pattern is along the lines of: "for user X, list the next
> 30 values, starting with valueid Y". Is that right? And these values should
> be returned sorted by valueid?
>
> The tl;dr version is that you should probably go with one row per
> user+value, and not build a complicated intra-row pagination scheme on your
> own unless you're really sure it is needed.
>
> Your two options mirror a common question people have when designing HBase
> schemas: should I go "tall" or "wide"? Your first schema is "tall": each
> row represents one value for one user, and so there are many rows in the
> table for each user; the row key is user + valueid, and there would be
> (presumably) a single column qualifier that means "the value". This is
> great if you want to scan over rows in sorted order by row key (thus my
> question above, about whether these ids are sorted correctly). You can
> start a scan at any user+valueid, read the next 30, and be done. What
> you're giving up is the ability to have transactional guarantees around all
> the rows for one user, but it doesn't sound like you need that. Doing it
> this way is generally recommended (see here<
> http://hbase.apache.org/book.html#schema.smackdown>).
>
> Your second option is "wide": you store a bunch of values in one row,
> using different qualifiers (where the qualifier is the valueid). The simple
> way to do that would be to just store ALL values for one user in a single
> row. I'm guessing you jumped to the "paginated" version because you're
> assuming that storing millions of columns in a single row would be bad for
> performance, which may or may not be true; as long as you're not trying to
> do too much in a single request, or do things like scanning over and
> returning all of the cells in the row, it shouldn't be fundamentally worse.
> The client has methods that allow you to get specific slices of columns.
>
> Note that neither case fundamentally uses more disk space than the other;
> you're just "shifting" part of the identifying information for a value
> either to the left (into the row key, in option one) or to the right (into
> the column qualifiers in option 2). Under the covers, every key/value still
> stores the whole row key, and column family name. (If this is a bit
> confusing, take an hour and watch Lars George's excellent video about
> understanding HBase schema design:
> http://www.youtube.com/watch?v=_HLoH_PgrLk).
>
> A manually paginated version has lots more complexities, as you note, like
> having to keep track of how many things are in each page, re-shuffling if
> new values are inserted, etc. That seems significantly more complex. It
> might have some slight speed advantages (or disadvantages!) at extremely
> high throughput, and the only way to really know that would be to try it
> out. If you don't have time to build it both ways and compare, my advice
> would be to start with the simplest option (one row per user+value). Start
> simple & iterate! :)
>
> (Let me know if I've misunderstood your situation.)
>
> Ian
>
> On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:
>
> We're looking at how to store a large amount of (per-user) list data in
> hbase, and we were trying to figure out what kind of access pattern made
> the most sense.  One option is store the majority of the data in a key, so
> we could have something like
> <FixedWidthUserName><FixedWidthValueId1>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId2>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
>
> The other option we hade was to do this entirely using
>
> <FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
> <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
> where each row would contain multiple values.
> So in one case reading the first thirty values would be
> scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
> And in the second case it would be
> get 'FixedWidthUserName\x00\x00\x00\x00'
>
> The general usage pattern would be to read only the first 30 values of
> these lists, with infrequent access reading deeper into the lists.  Some
> users would have <= 30 total values in these lists, and some users would
> have millions (i.e. power-law distribution)
>
>  The single-value format seems like it would take up more space on hbase,
> but would offer some improved retrieval / pagination flexibility.  Would
> there be any significant performance advantages to be able to paginate via
> gets vs paginating with scans?
>   My initial understanding was that doing a scan should be faster if our
> paging size is unknown (and caching is set appropriately), but that gets
> should be faster if we'll always need the same page size.  I've ended up
> hearing different people tell me opposite things about performance.  I
> assume the page sizes would be relatively consistent, so for most use cases
> we could guarantee that we only wanted one page of data in the
> fixed-page-length case.  I would also assume that we would have infrequent
> updates, but may have inserts into the middle of these lists (meaning we'd
> need to update all subsequent rows).
>
> Thanks for help / suggestions / followup questions
>
> --Derek
>
>

Re: Key Design Question for list data

Posted by Jacques <wh...@gmail.com>.

Great summary!   The one other thing I would also note is that your
consistency requirements need to be considered.  If you want to make atomic
list changes you'll need to go wide.
On Apr 3, 2012 1:00 PM, "Ian Varley" <iv...@salesforce.com> wrote:

> Hi Derek,
>
> If I understand you correctly, you're ultimately trying to store triples
> in the form "user, valueid, value", right? E.g., something like:
>
> "user123, firstname, Paul",
> "user234, lastname, Smith"
>
> (But the usernames are fixed width, and the valueids are fixed width).
>
> And, your access pattern is along the lines of: "for user X, list the next
> 30 values, starting with valueid Y". Is that right? And these values should
> be returned sorted by valueid?
>
> The tl;dr version is that you should probably go with one row per
> user+value, and not build a complicated intra-row pagination scheme on your
> own unless you're really sure it is needed.
>
> Your two options mirror a common question people have when designing HBase
> schemas: should I go "tall" or "wide"? Your first schema is "tall": each
> row represents one value for one user, and so there are many rows in the
> table for each user; the row key is user + valueid, and there would be
> (presumably) a single column qualifier that means "the value". This is
> great if you want to scan over rows in sorted order by row key (thus my
> question above, about whether these ids are sorted correctly). You can
> start a scan at any user+valueid, read the next 30, and be done. What
> you're giving up is the ability to have transactional guarantees around all
> the rows for one user, but it doesn't sound like you need that. Doing it
> this way is generally recommended (see here<
> http://hbase.apache.org/book.html#schema.smackdown>).
>
> Your second option is "wide": you store a bunch of values in one row,
> using different qualifiers (where the qualifier is the valueid). The simple
> way to do that would be to just store ALL values for one user in a single
> row. I'm guessing you jumped to the "paginated" version because you're
> assuming that storing millions of columns in a single row would be bad for
> performance, which may or may not be true; as long as you're not trying to
> do too much in a single request, or do things like scanning over and
> returning all of the cells in the row, it shouldn't be fundamentally worse.
> The client has methods that allow you to get specific slices of columns.
>
> Note that neither case fundamentally uses more disk space than the other;
> you're just "shifting" part of the identifying information for a value
> either to the left (into the row key, in option one) or to the right (into
> the column qualifiers in option 2). Under the covers, every key/value still
> stores the whole row key, and column family name. (If this is a bit
> confusing, take an hour and watch Lars George's excellent video about
> understanding HBase schema design:
> http://www.youtube.com/watch?v=_HLoH_PgrLk).
>
> A manually paginated version has lots more complexities, as you note, like
> having to keep track of how many things are in each page, re-shuffling if
> new values are inserted, etc. That seems significantly more complex. It
> might have some slight speed advantages (or disadvantages!) at extremely
> high throughput, and the only way to really know that would be to try it
> out. If you don't have time to build it both ways and compare, my advice
> would be to start with the simplest option (one row per user+value). Start
> simple & iterate! :)
>
> (Let me know if I've misunderstood your situation.)
>
> Ian
>
> On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:
>
> We're looking at how to store a large amount of (per-user) list data in
> hbase, and we were trying to figure out what kind of access pattern made
> the most sense.  One option is store the majority of the data in a key, so
> we could have something like
> <FixedWidthUserName><FixedWidthValueId1>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId2>:"" (no value)
> <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
>
> The other option we hade was to do this entirely using
>
> <FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
> <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
>
> where each row would contain multiple values.
> So in one case reading the first thirty values would be
> scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
> And in the second case it would be
> get 'FixedWidthUserName\x00\x00\x00\x00'
>
> The general usage pattern would be to read only the first 30 values of
> these lists, with infrequent access reading deeper into the lists.  Some
> users would have <= 30 total values in these lists, and some users would
> have millions (i.e. power-law distribution)
>
>  The single-value format seems like it would take up more space on hbase,
> but would offer some improved retrieval / pagination flexibility.  Would
> there be any significant performance advantages to be able to paginate via
> gets vs paginating with scans?
>   My initial understanding was that doing a scan should be faster if our
> paging size is unknown (and caching is set appropriately), but that gets
> should be faster if we'll always need the same page size.  I've ended up
> hearing different people tell me opposite things about performance.  I
> assume the page sizes would be relatively consistent, so for most use cases
> we could guarantee that we only wanted one page of data in the
> fixed-page-length case.  I would also assume that we would have infrequent
> updates, but may have inserts into the middle of these lists (meaning we'd
> need to update all subsequent rows).
>
> Thanks for help / suggestions / followup questions
>
> --Derek
>
>

Re: Key Design Question for list data

Posted by Ian Varley <iv...@salesforce.com>.

Hi Derek,

If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.

Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that. Doing it this way is generally recommended (see here<http://hbase.apache.org/book.html#schema.smackdown>).

Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse. The client has methods that allow you to get specific slices of columns.

Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple & iterate! :)

(Let me know if I've misunderstood your situation.)

Ian

On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote:

We're looking at how to store a large amount of (per-user) list data in
hbase, and we were trying to figure out what kind of access pattern made
the most sense.  One option is store the majority of the data in a key, so
we could have something like
<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we hade was to do this entirely using
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values.
So in one case reading the first thirty values would be
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of
these lists, with infrequent access reading deeper into the lists.  Some
users would have <= 30 total values in these lists, and some users would
have millions (i.e. power-law distribution)

  The single-value format seems like it would take up more space on hbase,
but would offer some improved retrieval / pagination flexibility.  Would
there be any significant performance advantages to be able to paginate via
gets vs paginating with scans?
   My initial understanding was that doing a scan should be faster if our
paging size is unknown (and caching is set appropriately), but that gets
should be faster if we'll always need the same page size.  I've ended up
hearing different people tell me opposite things about performance.  I
assume the page sizes would be relatively consistent, so for most use cases
we could guarantee that we only wanted one page of data in the
fixed-page-length case.  I would also assume that we would have infrequent
updates, but may have inserts into the middle of these lists (meaning we'd
need to update all subsequent rows).

Thanks for help / suggestions / followup questions

--Derek