You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vaibhav Puranik <vp...@gmail.com> on 2010/06/17 20:14:26 UTC

Sorting columns

Hi all,

We have a HBase table with a column family. Every row can have millions of
columns in the column family. Thus the structure is:

rowKey: {data: {string1:0, string2:5, string3:7, string4:89,
string5:56.......}}

Where data is the column family. string1, string2... are the column names
and the number in front of them is the column value.

Is there any way to get the columns in the sorted order of its values? We
want to get the first few columns only evertime we scan the table.
The rest of the data needs to be accessed occasionally. We want to avoid
getting it shipped to the client as it makes our map reduce job go out of
memory.

I couldn't find a way to sort columns by its value in the API. We are using
the latest version - 0.20.4 with hadoop 0.20.2.

Regards,
Vaibhav
GumGum
http://whynosql.com

Re: Sorting columns

Posted by Andrey Stepachev <oc...@gmail.com>.

2010/6/21 Jonathan Gray <jg...@facebook.com>

> Yes, when using Scan, even on 0.20, everything will be sorted.
>

Good. And is this a case for infra row. (as I understand, sorting is
achieved by merge scan of stores).


>
> Re: OOM, you'll need more memory or you'll need to break stuff up across
> rows.  Not much else to be done about that :)
>

But with infrarow scan i can avoid OOM (and it works) :). But question was
in order of infra row scanning.


>
> > -----Original Message-----
> > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > Sent: Monday, June 21, 2010 6:40 AM
> > To: user@hbase.apache.org
> > Subject: Re: Sorting columns
> >
> > 2010/6/19 Jonathan Gray <jg...@facebook.com>
> >
> > > So there is no confusion, everything is sorted in HBase.  All columns
> > in
> > > each family are sorted, always.
> > >
> >
> > Thans a good news!. Thanks. I have no time (and enought knowlage of
> > hbase)
> > to check this myself. No it's clear (and I use scan always for now).
> >
> >
> > >
> > > There are optimizations for Get queries (in 0.20 but gone in trunk)
> > that
> > > make it so that what gets returned to the client is not completely
> > sorted
> > > though it would be mostly sorted.
> >
> > Is it true, that if i use Scan (even when scan is really get) in 0.20,
> > i'll
> > got all things sorted?
> >
> >
> > > Are you returning millions of columns at once?  Otherwise it
> > shouldn't be
> > > too expensive to do the sorted() call in the client.
> > >
> > I got a OOM when i try to build index (i have 1 index key which points
> > to
> > 5mil another keys, so I got OOM in server). With infrarow I can scan
> > this
> > columns (in mr job mostly) to doing some work.
> > After I got OOM, i change schema to use compound keys. It is a bit
> > complicated to make such keys (instead of simple LongWritable and
> > friends).
> > May be avro can help, but i don't try yet. With infra row I got
> > slightly
> > complicated Result scan (i need to detect real key change), but this
> > way is
> > less complicated, then compound keys.
> >
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > > Sent: Saturday, June 19, 2010 5:45 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Sorting columns
> > > >
> > > > 2010/6/19 Stack <st...@duboce.net>
> > > >
> > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> > <oc...@gmail.com>
> > > > > wrote:
> > > > > > As i see in sources there no place, where kv sorted (except
> > client
> > > > > > Result.sorted() method). So we can get keyvalues from store and
> > > > from
> > > > > > memstore (and in this case we can get 1 3 5 from stores and 4
> > from
> > > > > memstore)
> > > > > > in incorrect order.
> > > > > >
> > > > > > Or I miss something?
> > > > > >
> > > > >
> > > > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > > > against
> > > > > each data store element -- memstore and one for each store file -
> > -
> > > > and
> > > > > we'll pop off the elements in order.  Thats the general story.
> > There
> > > > > may once have been a legitimate reason for the client-side sort -
> > -
> > > > > perhaps when our Get and Scan code paths differed it was needed -
> > -
> > > > but
> > > > > as to whether it still required, I'm not sure.  I'd have to dig.
> > Any
> > > > > one else?
> > > > >
> > > >
> > > > It is very interesting to know, is hbase guarantee ordering in
> > columns.
> > > > Because if
> > > > someone will use very wide rows, in absence of sorting, it is not
> > very
> > > > useful (and of course
> > > > someone should know about partitioning problem for wide rows).
> > > > Suppose, that we want to work with time data, in that case we can
> > use
> > > > qualifiers as
> > > > date and expect data in sorted order and we can't order it
> > somewhere
> > > > else,
> > > > because
> > > > we will lost most of hbase advantage.
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >> > The rest of the data needs to be accessed occasionally. We
> > want
> > > > to
> > > > > avoid
> > > > > >> > getting it shipped to the client as it makes our map reduce
> > job
> > > > go out
> > > > > of
> > > > > >> > memory.
> > > > > >> >
> > > > > >>
> > > > > >> You are not using incremental get on a row?  You should be
> > able to
> > > > get
> > > > > >> your big rows piecemeal.
> > > > > >>
> > > > > > This scanner api changes was not included in 0.20.4 :( (infra
> > row
> > > > > scanner).
> > > > > >
> > > > >
> > > > > Oh.
> > > > >
> > > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4
> > I'm
> > > > > rolling now.  Please excuse our bungling.
> > > > >
> > > >
> > > > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > > > patch
> > > > 0.20.5 will be not wire compatible with 0.20.4 (because this patch
> > adds
> > > > additional
> > > > field in Scan, and this make Scan binary incompatible).
> > > >
> > > > I'm, personnaly, not using now infrarow scanner, because of unknown
> > > > ordering, i use
> > > > compound keys.
> > > > More over, infrarow scanning should use separate api (giving Result
> > the
> > > > ability
> > > > to fetch additional kvs for given row) to be mo usable and easy to
> > use.
> > >
>

RE: Sorting columns

Posted by Jonathan Gray <jg...@facebook.com>.

There will be a development release sometime next week but that will not be recommended for production usage.

There is no release date for the full version but I think we're hoping to have a release candidate before the end of July.

> -----Original Message-----
> From: Vaibhav Puranik [mailto:vpuranik@gmail.com]
> Sent: Monday, June 21, 2010 9:48 AM
> To: user@hbase.apache.org
> Subject: Re: Sorting columns
> 
> Jon, Stack,
> 
> Is there a tentative date when this version (with column scanner) is
> coming
> out?
> 
> Vaibhav
> 
> On Mon, Jun 21, 2010 at 9:28 AM, Jonathan Gray <jg...@facebook.com>
> wrote:
> 
> > Yes, when using Scan, even on 0.20, everything will be sorted.
> >
> > Re: OOM, you'll need more memory or you'll need to break stuff up
> across
> > rows.  Not much else to be done about that :)
> >
> > > -----Original Message-----
> > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > Sent: Monday, June 21, 2010 6:40 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Sorting columns
> > >
> > > 2010/6/19 Jonathan Gray <jg...@facebook.com>
> > >
> > > > So there is no confusion, everything is sorted in HBase.  All
> columns
> > > in
> > > > each family are sorted, always.
> > > >
> > >
> > > Thans a good news!. Thanks. I have no time (and enought knowlage of
> > > hbase)
> > > to check this myself. No it's clear (and I use scan always for
> now).
> > >
> > >
> > > >
> > > > There are optimizations for Get queries (in 0.20 but gone in
> trunk)
> > > that
> > > > make it so that what gets returned to the client is not
> completely
> > > sorted
> > > > though it would be mostly sorted.
> > >
> > > Is it true, that if i use Scan (even when scan is really get) in
> 0.20,
> > > i'll
> > > got all things sorted?
> > >
> > >
> > > > Are you returning millions of columns at once?  Otherwise it
> > > shouldn't be
> > > > too expensive to do the sorted() call in the client.
> > > >
> > > I got a OOM when i try to build index (i have 1 index key which
> points
> > > to
> > > 5mil another keys, so I got OOM in server). With infrarow I can
> scan
> > > this
> > > columns (in mr job mostly) to doing some work.
> > > After I got OOM, i change schema to use compound keys. It is a bit
> > > complicated to make such keys (instead of simple LongWritable and
> > > friends).
> > > May be avro can help, but i don't try yet. With infra row I got
> > > slightly
> > > complicated Result scan (i need to detect real key change), but
> this
> > > way is
> > > less complicated, then compound keys.
> > >
> > >
> > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > > > Sent: Saturday, June 19, 2010 5:45 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: Sorting columns
> > > > >
> > > > > 2010/6/19 Stack <st...@duboce.net>
> > > > >
> > > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> > > <oc...@gmail.com>
> > > > > > wrote:
> > > > > > > As i see in sources there no place, where kv sorted (except
> > > client
> > > > > > > Result.sorted() method). So we can get keyvalues from store
> and
> > > > > from
> > > > > > > memstore (and in this case we can get 1 3 5 from stores and
> 4
> > > from
> > > > > > memstore)
> > > > > > > in incorrect order.
> > > > > > >
> > > > > > > Or I miss something?
> > > > > > >
> > > > > >
> > > > > > Data is sorted in hbase.  Scanning, we'll be running a
> scanner
> > > > > against
> > > > > > each data store element -- memstore and one for each store
> file -
> > > -
> > > > > and
> > > > > > we'll pop off the elements in order.  Thats the general
> story.
> > > There
> > > > > > may once have been a legitimate reason for the client-side
> sort -
> > > -
> > > > > > perhaps when our Get and Scan code paths differed it was
> needed -
> > > -
> > > > > but
> > > > > > as to whether it still required, I'm not sure.  I'd have to
> dig.
> > > Any
> > > > > > one else?
> > > > > >
> > > > >
> > > > > It is very interesting to know, is hbase guarantee ordering in
> > > columns.
> > > > > Because if
> > > > > someone will use very wide rows, in absence of sorting, it is
> not
> > > very
> > > > > useful (and of course
> > > > > someone should know about partitioning problem for wide rows).
> > > > > Suppose, that we want to work with time data, in that case we
> can
> > > use
> > > > > qualifiers as
> > > > > date and expect data in sorted order and we can't order it
> > > somewhere
> > > > > else,
> > > > > because
> > > > > we will lost most of hbase advantage.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >> > The rest of the data needs to be accessed occasionally.
> We
> > > want
> > > > > to
> > > > > > avoid
> > > > > > >> > getting it shipped to the client as it makes our map
> reduce
> > > job
> > > > > go out
> > > > > > of
> > > > > > >> > memory.
> > > > > > >> >
> > > > > > >>
> > > > > > >> You are not using incremental get on a row?  You should be
> > > able to
> > > > > get
> > > > > > >> your big rows piecemeal.
> > > > > > >>
> > > > > > > This scanner api changes was not included in 0.20.4 :(
> (infra
> > > row
> > > > > > scanner).
> > > > > > >
> > > > > >
> > > > > > Oh.
> > > > > >
> > > > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > > > HBASE-1537.  I just applied it.  It'll appear in the
> 0.20.5RC4
> > > I'm
> > > > > > rolling now.  Please excuse our bungling.
> > > > > >
> > > > >
> > > > > Not a problem. I'll wait 0.20.5. But I should warn, that with
> this
> > > > > patch
> > > > > 0.20.5 will be not wire compatible with 0.20.4 (because this
> patch
> > > adds
> > > > > additional
> > > > > field in Scan, and this make Scan binary incompatible).
> > > > >
> > > > > I'm, personnaly, not using now infrarow scanner, because of
> unknown
> > > > > ordering, i use
> > > > > compound keys.
> > > > > More over, infrarow scanning should use separate api (giving
> Result
> > > the
> > > > > ability
> > > > > to fetch additional kvs for given row) to be mo usable and easy
> to
> > > use.
> > > >
> >

Re: Sorting columns

Posted by Vaibhav Puranik <vp...@gmail.com>.

Jon, Stack,

Is there a tentative date when this version (with column scanner) is coming
out?

Vaibhav

On Mon, Jun 21, 2010 at 9:28 AM, Jonathan Gray <jg...@facebook.com> wrote:

> Yes, when using Scan, even on 0.20, everything will be sorted.
>
> Re: OOM, you'll need more memory or you'll need to break stuff up across
> rows.  Not much else to be done about that :)
>
> > -----Original Message-----
> > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > Sent: Monday, June 21, 2010 6:40 AM
> > To: user@hbase.apache.org
> > Subject: Re: Sorting columns
> >
> > 2010/6/19 Jonathan Gray <jg...@facebook.com>
> >
> > > So there is no confusion, everything is sorted in HBase.  All columns
> > in
> > > each family are sorted, always.
> > >
> >
> > Thans a good news!. Thanks. I have no time (and enought knowlage of
> > hbase)
> > to check this myself. No it's clear (and I use scan always for now).
> >
> >
> > >
> > > There are optimizations for Get queries (in 0.20 but gone in trunk)
> > that
> > > make it so that what gets returned to the client is not completely
> > sorted
> > > though it would be mostly sorted.
> >
> > Is it true, that if i use Scan (even when scan is really get) in 0.20,
> > i'll
> > got all things sorted?
> >
> >
> > > Are you returning millions of columns at once?  Otherwise it
> > shouldn't be
> > > too expensive to do the sorted() call in the client.
> > >
> > I got a OOM when i try to build index (i have 1 index key which points
> > to
> > 5mil another keys, so I got OOM in server). With infrarow I can scan
> > this
> > columns (in mr job mostly) to doing some work.
> > After I got OOM, i change schema to use compound keys. It is a bit
> > complicated to make such keys (instead of simple LongWritable and
> > friends).
> > May be avro can help, but i don't try yet. With infra row I got
> > slightly
> > complicated Result scan (i need to detect real key change), but this
> > way is
> > less complicated, then compound keys.
> >
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > > Sent: Saturday, June 19, 2010 5:45 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Sorting columns
> > > >
> > > > 2010/6/19 Stack <st...@duboce.net>
> > > >
> > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> > <oc...@gmail.com>
> > > > > wrote:
> > > > > > As i see in sources there no place, where kv sorted (except
> > client
> > > > > > Result.sorted() method). So we can get keyvalues from store and
> > > > from
> > > > > > memstore (and in this case we can get 1 3 5 from stores and 4
> > from
> > > > > memstore)
> > > > > > in incorrect order.
> > > > > >
> > > > > > Or I miss something?
> > > > > >
> > > > >
> > > > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > > > against
> > > > > each data store element -- memstore and one for each store file -
> > -
> > > > and
> > > > > we'll pop off the elements in order.  Thats the general story.
> > There
> > > > > may once have been a legitimate reason for the client-side sort -
> > -
> > > > > perhaps when our Get and Scan code paths differed it was needed -
> > -
> > > > but
> > > > > as to whether it still required, I'm not sure.  I'd have to dig.
> > Any
> > > > > one else?
> > > > >
> > > >
> > > > It is very interesting to know, is hbase guarantee ordering in
> > columns.
> > > > Because if
> > > > someone will use very wide rows, in absence of sorting, it is not
> > very
> > > > useful (and of course
> > > > someone should know about partitioning problem for wide rows).
> > > > Suppose, that we want to work with time data, in that case we can
> > use
> > > > qualifiers as
> > > > date and expect data in sorted order and we can't order it
> > somewhere
> > > > else,
> > > > because
> > > > we will lost most of hbase advantage.
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >> > The rest of the data needs to be accessed occasionally. We
> > want
> > > > to
> > > > > avoid
> > > > > >> > getting it shipped to the client as it makes our map reduce
> > job
> > > > go out
> > > > > of
> > > > > >> > memory.
> > > > > >> >
> > > > > >>
> > > > > >> You are not using incremental get on a row?  You should be
> > able to
> > > > get
> > > > > >> your big rows piecemeal.
> > > > > >>
> > > > > > This scanner api changes was not included in 0.20.4 :( (infra
> > row
> > > > > scanner).
> > > > > >
> > > > >
> > > > > Oh.
> > > > >
> > > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4
> > I'm
> > > > > rolling now.  Please excuse our bungling.
> > > > >
> > > >
> > > > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > > > patch
> > > > 0.20.5 will be not wire compatible with 0.20.4 (because this patch
> > adds
> > > > additional
> > > > field in Scan, and this make Scan binary incompatible).
> > > >
> > > > I'm, personnaly, not using now infrarow scanner, because of unknown
> > > > ordering, i use
> > > > compound keys.
> > > > More over, infrarow scanning should use separate api (giving Result
> > the
> > > > ability
> > > > to fetch additional kvs for given row) to be mo usable and easy to
> > use.
> > >
>

RE: Sorting columns

Posted by Jonathan Gray <jg...@facebook.com>.

Yes, when using Scan, even on 0.20, everything will be sorted.

Re: OOM, you'll need more memory or you'll need to break stuff up across rows.  Not much else to be done about that :)

> -----Original Message-----
> From: Andrey Stepachev [mailto:octo47@gmail.com]
> Sent: Monday, June 21, 2010 6:40 AM
> To: user@hbase.apache.org
> Subject: Re: Sorting columns
> 
> 2010/6/19 Jonathan Gray <jg...@facebook.com>
> 
> > So there is no confusion, everything is sorted in HBase.  All columns
> in
> > each family are sorted, always.
> >
> 
> Thans a good news!. Thanks. I have no time (and enought knowlage of
> hbase)
> to check this myself. No it's clear (and I use scan always for now).
> 
> 
> >
> > There are optimizations for Get queries (in 0.20 but gone in trunk)
> that
> > make it so that what gets returned to the client is not completely
> sorted
> > though it would be mostly sorted.
> 
> Is it true, that if i use Scan (even when scan is really get) in 0.20,
> i'll
> got all things sorted?
> 
> 
> > Are you returning millions of columns at once?  Otherwise it
> shouldn't be
> > too expensive to do the sorted() call in the client.
> >
> I got a OOM when i try to build index (i have 1 index key which points
> to
> 5mil another keys, so I got OOM in server). With infrarow I can scan
> this
> columns (in mr job mostly) to doing some work.
> After I got OOM, i change schema to use compound keys. It is a bit
> complicated to make such keys (instead of simple LongWritable and
> friends).
> May be avro can help, but i don't try yet. With infra row I got
> slightly
> complicated Result scan (i need to detect real key change), but this
> way is
> less complicated, then compound keys.
> 
> 
> 
> >
> > > -----Original Message-----
> > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > Sent: Saturday, June 19, 2010 5:45 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Sorting columns
> > >
> > > 2010/6/19 Stack <st...@duboce.net>
> > >
> > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> <oc...@gmail.com>
> > > > wrote:
> > > > > As i see in sources there no place, where kv sorted (except
> client
> > > > > Result.sorted() method). So we can get keyvalues from store and
> > > from
> > > > > memstore (and in this case we can get 1 3 5 from stores and 4
> from
> > > > memstore)
> > > > > in incorrect order.
> > > > >
> > > > > Or I miss something?
> > > > >
> > > >
> > > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > > against
> > > > each data store element -- memstore and one for each store file -
> -
> > > and
> > > > we'll pop off the elements in order.  Thats the general story.
> There
> > > > may once have been a legitimate reason for the client-side sort -
> -
> > > > perhaps when our Get and Scan code paths differed it was needed -
> -
> > > but
> > > > as to whether it still required, I'm not sure.  I'd have to dig.
> Any
> > > > one else?
> > > >
> > >
> > > It is very interesting to know, is hbase guarantee ordering in
> columns.
> > > Because if
> > > someone will use very wide rows, in absence of sorting, it is not
> very
> > > useful (and of course
> > > someone should know about partitioning problem for wide rows).
> > > Suppose, that we want to work with time data, in that case we can
> use
> > > qualifiers as
> > > date and expect data in sorted order and we can't order it
> somewhere
> > > else,
> > > because
> > > we will lost most of hbase advantage.
> > >
> > >
> > >
> > > >
> > > > >
> > > > >> > The rest of the data needs to be accessed occasionally. We
> want
> > > to
> > > > avoid
> > > > >> > getting it shipped to the client as it makes our map reduce
> job
> > > go out
> > > > of
> > > > >> > memory.
> > > > >> >
> > > > >>
> > > > >> You are not using incremental get on a row?  You should be
> able to
> > > get
> > > > >> your big rows piecemeal.
> > > > >>
> > > > > This scanner api changes was not included in 0.20.4 :( (infra
> row
> > > > scanner).
> > > > >
> > > >
> > > > Oh.
> > > >
> > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4
> I'm
> > > > rolling now.  Please excuse our bungling.
> > > >
> > >
> > > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > > patch
> > > 0.20.5 will be not wire compatible with 0.20.4 (because this patch
> adds
> > > additional
> > > field in Scan, and this make Scan binary incompatible).
> > >
> > > I'm, personnaly, not using now infrarow scanner, because of unknown
> > > ordering, i use
> > > compound keys.
> > > More over, infrarow scanning should use separate api (giving Result
> the
> > > ability
> > > to fetch additional kvs for given row) to be mo usable and easy to
> use.
> >

Re: Sorting columns

Posted by Andrey Stepachev <oc...@gmail.com>.

2010/6/19 Jonathan Gray <jg...@facebook.com>

> So there is no confusion, everything is sorted in HBase.  All columns in
> each family are sorted, always.
>

Thans a good news!. Thanks. I have no time (and enought knowlage of hbase)
to check this myself. No it's clear (and I use scan always for now).


>
> There are optimizations for Get queries (in 0.20 but gone in trunk) that
> make it so that what gets returned to the client is not completely sorted
> though it would be mostly sorted.

Is it true, that if i use Scan (even when scan is really get) in 0.20, i'll
got all things sorted?


> Are you returning millions of columns at once?  Otherwise it shouldn't be
> too expensive to do the sorted() call in the client.
>
I got a OOM when i try to build index (i have 1 index key which points to
5mil another keys, so I got OOM in server). With infrarow I can scan this
columns (in mr job mostly) to doing some work.
After I got OOM, i change schema to use compound keys. It is a bit
complicated to make such keys (instead of simple LongWritable and friends).
May be avro can help, but i don't try yet. With infra row I got slightly
complicated Result scan (i need to detect real key change), but this way is
less complicated, then compound keys.



>
> > -----Original Message-----
> > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > Sent: Saturday, June 19, 2010 5:45 AM
> > To: user@hbase.apache.org
> > Subject: Re: Sorting columns
> >
> > 2010/6/19 Stack <st...@duboce.net>
> >
> > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <oc...@gmail.com>
> > > wrote:
> > > > As i see in sources there no place, where kv sorted (except client
> > > > Result.sorted() method). So we can get keyvalues from store and
> > from
> > > > memstore (and in this case we can get 1 3 5 from stores and 4 from
> > > memstore)
> > > > in incorrect order.
> > > >
> > > > Or I miss something?
> > > >
> > >
> > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > against
> > > each data store element -- memstore and one for each store file --
> > and
> > > we'll pop off the elements in order.  Thats the general story.  There
> > > may once have been a legitimate reason for the client-side sort --
> > > perhaps when our Get and Scan code paths differed it was needed --
> > but
> > > as to whether it still required, I'm not sure.  I'd have to dig.  Any
> > > one else?
> > >
> >
> > It is very interesting to know, is hbase guarantee ordering in columns.
> > Because if
> > someone will use very wide rows, in absence of sorting, it is not very
> > useful (and of course
> > someone should know about partitioning problem for wide rows).
> > Suppose, that we want to work with time data, in that case we can use
> > qualifiers as
> > date and expect data in sorted order and we can't order it somewhere
> > else,
> > because
> > we will lost most of hbase advantage.
> >
> >
> >
> > >
> > > >
> > > >> > The rest of the data needs to be accessed occasionally. We want
> > to
> > > avoid
> > > >> > getting it shipped to the client as it makes our map reduce job
> > go out
> > > of
> > > >> > memory.
> > > >> >
> > > >>
> > > >> You are not using incremental get on a row?  You should be able to
> > get
> > > >> your big rows piecemeal.
> > > >>
> > > > This scanner api changes was not included in 0.20.4 :( (infra row
> > > scanner).
> > > >
> > >
> > > Oh.
> > >
> > > Sorry about that Andrey.  Somehow we missed your backport of
> > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> > > rolling now.  Please excuse our bungling.
> > >
> >
> > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > patch
> > 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
> > additional
> > field in Scan, and this make Scan binary incompatible).
> >
> > I'm, personnaly, not using now infrarow scanner, because of unknown
> > ordering, i use
> > compound keys.
> > More over, infrarow scanning should use separate api (giving Result the
> > ability
> > to fetch additional kvs for given row) to be mo usable and easy to use.
>

RE: Sorting columns

Posted by Jonathan Gray <jg...@facebook.com>.

So there is no confusion, everything is sorted in HBase.  All columns in each family are sorted, always.

There are optimizations for Get queries (in 0.20 but gone in trunk) that make it so that what gets returned to the client is not completely sorted though it would be mostly sorted.  Are you returning millions of columns at once?  Otherwise it shouldn't be too expensive to do the sorted() call in the client.

> -----Original Message-----
> From: Andrey Stepachev [mailto:octo47@gmail.com]
> Sent: Saturday, June 19, 2010 5:45 AM
> To: user@hbase.apache.org
> Subject: Re: Sorting columns
> 
> 2010/6/19 Stack <st...@duboce.net>
> 
> > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <oc...@gmail.com>
> > wrote:
> > > As i see in sources there no place, where kv sorted (except client
> > > Result.sorted() method). So we can get keyvalues from store and
> from
> > > memstore (and in this case we can get 1 3 5 from stores and 4 from
> > memstore)
> > > in incorrect order.
> > >
> > > Or I miss something?
> > >
> >
> > Data is sorted in hbase.  Scanning, we'll be running a scanner
> against
> > each data store element -- memstore and one for each store file --
> and
> > we'll pop off the elements in order.  Thats the general story.  There
> > may once have been a legitimate reason for the client-side sort --
> > perhaps when our Get and Scan code paths differed it was needed --
> but
> > as to whether it still required, I'm not sure.  I'd have to dig.  Any
> > one else?
> >
> 
> It is very interesting to know, is hbase guarantee ordering in columns.
> Because if
> someone will use very wide rows, in absence of sorting, it is not very
> useful (and of course
> someone should know about partitioning problem for wide rows).
> Suppose, that we want to work with time data, in that case we can use
> qualifiers as
> date and expect data in sorted order and we can't order it somewhere
> else,
> because
> we will lost most of hbase advantage.
> 
> 
> 
> >
> > >
> > >> > The rest of the data needs to be accessed occasionally. We want
> to
> > avoid
> > >> > getting it shipped to the client as it makes our map reduce job
> go out
> > of
> > >> > memory.
> > >> >
> > >>
> > >> You are not using incremental get on a row?  You should be able to
> get
> > >> your big rows piecemeal.
> > >>
> > > This scanner api changes was not included in 0.20.4 :( (infra row
> > scanner).
> > >
> >
> > Oh.
> >
> > Sorry about that Andrey.  Somehow we missed your backport of
> > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> > rolling now.  Please excuse our bungling.
> >
> 
> Not a problem. I'll wait 0.20.5. But I should warn, that with this
> patch
> 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
> additional
> field in Scan, and this make Scan binary incompatible).
> 
> I'm, personnaly, not using now infrarow scanner, because of unknown
> ordering, i use
> compound keys.
> More over, infrarow scanning should use separate api (giving Result the
> ability
> to fetch additional kvs for given row) to be mo usable and easy to use.

Re: Sorting columns

Posted by Stack <st...@duboce.net>.

On Sat, Jun 19, 2010 at 5:44 AM, Andrey Stepachev <oc...@gmail.com> wrote:
> 2010/6/19 Stack <st...@duboce.net>
> ...  But I should warn, that with this patch
> 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
> additional
> field in Scan, and this make Scan binary incompatible).
>

OK.  Thanks for the warning (both you and Dave Latham).  I should have
caught this.  I killed the RC.  Am making a new one.
St.Ack

Re: Sorting columns

Posted by Andrey Stepachev <oc...@gmail.com>.

2010/6/19 Stack <st...@duboce.net>

> On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <oc...@gmail.com>
> wrote:
> > As i see in sources there no place, where kv sorted (except client
> > Result.sorted() method). So we can get keyvalues from store and from
> > memstore (and in this case we can get 1 3 5 from stores and 4 from
> memstore)
> > in incorrect order.
> >
> > Or I miss something?
> >
>
> Data is sorted in hbase.  Scanning, we'll be running a scanner against
> each data store element -- memstore and one for each store file -- and
> we'll pop off the elements in order.  Thats the general story.  There
> may once have been a legitimate reason for the client-side sort --
> perhaps when our Get and Scan code paths differed it was needed -- but
> as to whether it still required, I'm not sure.  I'd have to dig.  Any
> one else?
>

It is very interesting to know, is hbase guarantee ordering in columns.
Because if
someone will use very wide rows, in absence of sorting, it is not very
useful (and of course
someone should know about partitioning problem for wide rows).
Suppose, that we want to work with time data, in that case we can use
qualifiers as
date and expect data in sorted order and we can't order it somewhere else,
because
we will lost most of hbase advantage.

>
> >
> >> > The rest of the data needs to be accessed occasionally. We want to
> avoid
> >> > getting it shipped to the client as it makes our map reduce job go out
> of
> >> > memory.
> >> >
> >>
> >> You are not using incremental get on a row?  You should be able to get
> >> your big rows piecemeal.
> >>
> > This scanner api changes was not included in 0.20.4 :( (infra row
> scanner).
> >
>
> Oh.
>
> Sorry about that Andrey.  Somehow we missed your backport of
> HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> rolling now.  Please excuse our bungling.
>

Not a problem. I'll wait 0.20.5. But I should warn, that with this patch
0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
additional
field in Scan, and this make Scan binary incompatible).

I'm, personnaly, not using now infrarow scanner, because of unknown
ordering, i use
compound keys.
More over, infrarow scanning should use separate api (giving Result the
ability
to fetch additional kvs for given row) to be mo usable and easy to use.

RE: Sorting columns

Posted by Jonathan Gray <jg...@facebook.com>.

You're right stack.  In trunk, we no longer need Result.sorted().  Opened HBASE-2753.

With Get queries it used to be possible to have results returned not in fully-sorted order.  Now that gets are scans, results always get returned to the client fully-sorted.

Every file is sorted on disk, MemStore is sorted in memory, at read time we merge all the files and memstore, process according to the query parameters in your Get/Scan, and return a sorted list of KVs back to the client.

JG

> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Friday, June 18, 2010 11:23 PM
> To: user@hbase.apache.org
> Subject: Re: Sorting columns
> 
> On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <oc...@gmail.com>
> wrote:
> > As i see in sources there no place, where kv sorted (except client
> > Result.sorted() method). So we can get keyvalues from store and from
> > memstore (and in this case we can get 1 3 5 from stores and 4 from
> memstore)
> > in incorrect order.
> >
> > Or I miss something?
> >
> 
> Data is sorted in hbase.  Scanning, we'll be running a scanner against
> each data store element -- memstore and one for each store file -- and
> we'll pop off the elements in order.  Thats the general story.  There
> may once have been a legitimate reason for the client-side sort --
> perhaps when our Get and Scan code paths differed it was needed -- but
> as to whether it still required, I'm not sure.  I'd have to dig.  Any
> one else?
> 
> >
> >> > The rest of the data needs to be accessed occasionally. We want to
> avoid
> >> > getting it shipped to the client as it makes our map reduce job go
> out of
> >> > memory.
> >> >
> >>
> >> You are not using incremental get on a row?  You should be able to
> get
> >> your big rows piecemeal.
> >>
> > This scanner api changes was not included in 0.20.4 :( (infra row
> scanner).
> >
> 
> Oh.
> 
> Sorry about that Andrey.  Somehow we missed your backport of
> HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> rolling now.  Please excuse our bungling.
> 
> Yours,
> St.Ack

Re: Sorting columns

Posted by Stack <st...@duboce.net>.

On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <oc...@gmail.com> wrote:
> As i see in sources there no place, where kv sorted (except client
> Result.sorted() method). So we can get keyvalues from store and from
> memstore (and in this case we can get 1 3 5 from stores and 4 from memstore)
> in incorrect order.
>
> Or I miss something?
>

Data is sorted in hbase.  Scanning, we'll be running a scanner against
each data store element -- memstore and one for each store file -- and
we'll pop off the elements in order.  Thats the general story.  There
may once have been a legitimate reason for the client-side sort --
perhaps when our Get and Scan code paths differed it was needed -- but
as to whether it still required, I'm not sure.  I'd have to dig.  Any
one else?

>
>> > The rest of the data needs to be accessed occasionally. We want to avoid
>> > getting it shipped to the client as it makes our map reduce job go out of
>> > memory.
>> >
>>
>> You are not using incremental get on a row?  You should be able to get
>> your big rows piecemeal.
>>
> This scanner api changes was not included in 0.20.4 :( (infra row scanner).
>

Oh.

Sorry about that Andrey.  Somehow we missed your backport of
HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
rolling now.  Please excuse our bungling.

Yours,
St.Ack

Re: Sorting columns

Posted by Andrey Stepachev <oc...@gmail.com>.

2010/6/17 Stack <st...@duboce.net>

> On Thu, Jun 17, 2010 at 11:14 AM, Vaibhav Puranik <vp...@gmail.com>
> wrote:
> > Is there any way to get the columns in the sorted order of its values?
>
> Not by value.
>
> We
> > want to get the first few columns only evertime we scan the table.
>
> Sorted by value?
>
> HBase orders by row, family, qualifier.  Could you put the value into
> the qualifiier as a qualifier prefix and then hbase will take of the
> sort for you.
>

As i see in sources there no place, where kv sorted (except client
Result.sorted() method). So we can get keyvalues from store and from
memstore (and in this case we can get 1 3 5 from stores and 4 from memstore)
in incorrect order.

Or I miss something?


> > The rest of the data needs to be accessed occasionally. We want to avoid
> > getting it shipped to the client as it makes our map reduce job go out of
> > memory.
> >
>
> You are not using incremental get on a row?  You should be able to get
> your big rows piecemeal.
>
This scanner api changes was not included in 0.20.4 :( (infra row scanner).

RE: Sorting columns

Posted by Jonathan Gray <jg...@facebook.com>.

Don't confuse the Result API with the Get/Scan APIs.

If you only ask for columns A, B, and C with your Get query, getting the entire familyMap from the Result will only ever include columns A, B, and C since that's all you asked for.

> -----Original Message-----
> From: Vaibhav Puranik [mailto:vpuranik@gmail.com]
> Sent: Thursday, June 17, 2010 12:03 PM
> To: user@hbase.apache.org
> Subject: Re: Sorting columns
> 
> Stack,
> 
> Incremental sounds good, how do you do that, can you please point to a
> method/class?
> I only saw methods to get the entire family map.
> 
> Also is there any way to specify qualifier sort type - ascending or
> descending?
> 
> Regards,
> Vaibhav
> 
> On Thu, Jun 17, 2010 at 11:35 AM, Stack <st...@duboce.net> wrote:
> 
> > On Thu, Jun 17, 2010 at 11:14 AM, Vaibhav Puranik
> <vp...@gmail.com>
> > wrote:
> > > Is there any way to get the columns in the sorted order of its
> values?
> >
> > Not by value.
> >
> > We
> > > want to get the first few columns only evertime we scan the table.
> >
> > Sorted by value?
> >
> > HBase orders by row, family, qualifier.  Could you put the value into
> > the qualifiier as a qualifier prefix and then hbase will take of the
> > sort for you.
> >
> > > The rest of the data needs to be accessed occasionally. We want to
> avoid
> > > getting it shipped to the client as it makes our map reduce job go
> out of
> > > memory.
> > >
> >
> > You are not using incremental get on a row?  You should be able to
> get
> > your big rows piecemeal.
> >
> > Good on you Vaibhav,
> > St.Ack
> >

Re: Sorting columns

Posted by Vaibhav Puranik <vp...@gmail.com>.

Stack,

Incremental sounds good, how do you do that, can you please point to a
method/class?
I only saw methods to get the entire family map.

Also is there any way to specify qualifier sort type - ascending or
descending?

Regards,
Vaibhav

On Thu, Jun 17, 2010 at 11:35 AM, Stack <st...@duboce.net> wrote:

> On Thu, Jun 17, 2010 at 11:14 AM, Vaibhav Puranik <vp...@gmail.com>
> wrote:
> > Is there any way to get the columns in the sorted order of its values?
>
> Not by value.
>
> We
> > want to get the first few columns only evertime we scan the table.
>
> Sorted by value?
>
> HBase orders by row, family, qualifier.  Could you put the value into
> the qualifiier as a qualifier prefix and then hbase will take of the
> sort for you.
>
> > The rest of the data needs to be accessed occasionally. We want to avoid
> > getting it shipped to the client as it makes our map reduce job go out of
> > memory.
> >
>
> You are not using incremental get on a row?  You should be able to get
> your big rows piecemeal.
>
> Good on you Vaibhav,
> St.Ack
>

Re: Sorting columns

Posted by Stack <st...@duboce.net>.

On Thu, Jun 17, 2010 at 11:14 AM, Vaibhav Puranik <vp...@gmail.com> wrote:
> Is there any way to get the columns in the sorted order of its values?

Not by value.

We
> want to get the first few columns only evertime we scan the table.

Sorted by value?

HBase orders by row, family, qualifier.  Could you put the value into
the qualifiier as a qualifier prefix and then hbase will take of the
sort for you.

> The rest of the data needs to be accessed occasionally. We want to avoid
> getting it shipped to the client as it makes our map reduce job go out of
> memory.
>

You are not using incremental get on a row?  You should be able to get
your big rows piecemeal.

Good on you Vaibhav,
St.Ack