You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ric Wang <wq...@gmail.com> on 2009/06/09 23:10:01 UTC

scanner on a given column: whole table scan or just the rows that have values

Hi,

My HBase table has millions of rows; and on given column (ex.
famliyA:labelB), only a couple of thousand rows really have values (sparse).
Now my task is to find out the set of row keys whose column value of
"familyA:labelB" satisfy some kind of condition.

For that task, I am getting a scanner on the column "familyA:labelB";
looping over the values of that column (I guess I'd better off using some
kind of filter instead, but regardless...); if the value matches my
condition, I get the corresponding row key and add it into the result set.

My questions are:

1. When the scanner loops over the column, is it scanning the whole table of
millions of rows, or mostly just the ones that really have values for that
particular column? My guess is that it's NOT scanning the whole table per my
very limited understanding of how column-based database works; seems that'd
be awfully inefficient. Can someone please let me know?

2. If in the unfortunate case, that whole table scan does have to happen,
any suggestions on how I could change my table design (adding index..?) to
avoid the performance hit?

Thanks very much for your help!
Ric

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
might look in to the api for there packages
org.apache.hadoop.hbase.regionserver.tableindexed
org.apache.hadoop.hbase.client.tableindexed
http://hadoop.apache.org/hbase/docs/r0.19.3/api/index.html

Not sure anything about them I never used but I thank it allows a index on 
columns

Billy


"Naveen Koorakula" <na...@gmail.com> wrote 
in message 
news:5b9fff10906100150m5a549d65h3ca440af3a37e2d5@mail.gmail.com...
> That's correct - if you meant "it will have to scan EACH row in that 
> column
> family with atleast one non-empty cell".
>
> From http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture:
> "Each column family in a region is managed by an *HStore*. Each HStore may
> have one or more *MapFiles* (a Hadoop HDFS file type) that is very similar
> to a Google *SSTable*. Like SSTables, MapFiles are immutable once closed.
> MapFiles are stored in the Hadoop HDFS."
>
> The way to think of this would be that each column family in the table has
> its own file. The entries in the file look like:
> key:family:label:timestamp value
>
> Since only non-empty table cells are stored in this file, when you're
> scanning, you only are looking at all the rows that have non-empty values
> for atleast one column label in the column family in question.
>
> For eg: assuming a column family "cf", the Mapfile for column family "cf"
> might look like
>
> rowkey1 cf:label1 timestamp1 value1
> rowkey1 cf:label2 timestamp2 value2
> rowkey2 cf:label1 timestamp3 value3
> rowkey4 cf:label3 timestamp4 value4
>
> Even if the scanner is looking for "cf:label2", it will still have to go
> over the entire Mapfile to find these entries. That means it still has to
> scan through and discard all the cf:label1 and cf:label3 entries to get to
> the cf:label2 entries. (Note that in the above example, rowkey3 did not 
> have
> a cf:labelX entry, therefore the scanner did not have to scan through that
> row, even if rowkey3 did have values for other columns in the table)
>
> I would recommend reading through the Bigtable paper to understand the 
> data
> model. (Caveat: HBase does deviate slightly from the Bigtable data model -
> no access groups)
>
> Naveen
>
> On Tue, Jun 9, 2009 at 11:22 PM, Ric Wang 
> <wq...@gmail.com> wrote:
>
>> Billy,
>>
>> Thank you, it's clearer to me now. But WITHIN the one family where the
>> column-label that needs to be scanned over lives (since I only have one
>> family for the entire table), it will still have to scan EVERY row in 
>> that
>> family no matter if each cell on that column-label has value or not?
>>
>> -Ric
>>
>>
>> On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson
>> <sa...@pearsonwholesale.com>wrote:
>>
>> > It will not scan every row if there is more then one column family only
>> the
>> > rows that have data for that column.
>> >
>> > You do have parallelism when scanning large tables the mr job should be
>> > splitting the job in to one mapper per region
>> > if coded setup correctly. New patches in dev set for 0.20 will allow 
>> > more
>> > mappers per region speeding up this in some cases.
>> >
>> > Row-based database can have index but they do not scale well index
>> require
>> > more memory
>> > Hbase is designed to be Distributed parallel fault tolerant that scales
>> > easy from 1 to hundreds to thousands of servers
>> >
>> > Billy
>> >
>> >
>> >
>> > "Ric Wang" <wq...@gmail.com> wrote in 
>> > message
>> > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
>> >
>> >  Hi,
>> >>
>> >> Thanks. But if it is still scanning EVERY row in the entire table, how
>> >> does
>> >> HBase achieve better scan performance, compared to a row-based 
>> >> database?
>> >>
>> >> Thanks,
>> >> Ric
>> >>
>> >>
>> >>
>> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson 
>> >> <ry...@gmail.com> wrote:
>> >>
>> >>  Without the use of indexes, there is no easy way to get the info
>> without
>> >>> touching every row.
>> >>>
>> >>> So yes you'll be scanning every row.  But hbase has good bulk scan
>> perf.
>> >>>
>> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" 
>> >>> <wq...@gmail.com> wrote:
>> >>>
>> >>> How does the scanner know how to get ONLY the "relevant" rows, 
>> >>> without
>> a
>> >>> whole table scan?
>> >>>
>> >>> Thanks!
>> >>> Ric
>> >>>
>> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula 
>> >>> <na...@gmail.com>
>> >>> wrote:
>> >>> > The scanner only s...
>> >>> --
>> >>>
>> >>> Ric Wang wqt.work@gmail.com
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Ric Wang
>> >> wqt.work@gmail.com
>> >>
>> >>
>> >
>> >
>>
>>
>> --
>> Ric Wang
>> wqt.work@gmail.com
>>
> 



Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Naveen Koorakula <na...@gmail.com>.
That's correct - if you meant "it will have to scan EACH row in that column
family with atleast one non-empty cell".

>From http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture:
"Each column family in a region is managed by an *HStore*. Each HStore may
have one or more *MapFiles* (a Hadoop HDFS file type) that is very similar
to a Google *SSTable*. Like SSTables, MapFiles are immutable once closed.
MapFiles are stored in the Hadoop HDFS."

The way to think of this would be that each column family in the table has
its own file. The entries in the file look like:
key:family:label:timestamp value

Since only non-empty table cells are stored in this file, when you're
scanning, you only are looking at all the rows that have non-empty values
for atleast one column label in the column family in question.

For eg: assuming a column family "cf", the Mapfile for column family "cf"
might look like

rowkey1 cf:label1 timestamp1 value1
rowkey1 cf:label2 timestamp2 value2
rowkey2 cf:label1 timestamp3 value3
rowkey4 cf:label3 timestamp4 value4

Even if the scanner is looking for "cf:label2", it will still have to go
over the entire Mapfile to find these entries. That means it still has to
scan through and discard all the cf:label1 and cf:label3 entries to get to
the cf:label2 entries. (Note that in the above example, rowkey3 did not have
a cf:labelX entry, therefore the scanner did not have to scan through that
row, even if rowkey3 did have values for other columns in the table)

I would recommend reading through the Bigtable paper to understand the data
model. (Caveat: HBase does deviate slightly from the Bigtable data model -
no access groups)

Naveen

On Tue, Jun 9, 2009 at 11:22 PM, Ric Wang <wq...@gmail.com> wrote:

> Billy,
>
> Thank you, it's clearer to me now. But WITHIN the one family where the
> column-label that needs to be scanned over lives (since I only have one
> family for the entire table), it will still have to scan EVERY row in that
> family no matter if each cell on that column-label has value or not?
>
> -Ric
>
>
> On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson
> <sa...@pearsonwholesale.com>wrote:
>
> > It will not scan every row if there is more then one column family only
> the
> > rows that have data for that column.
> >
> > You do have parallelism when scanning large tables the mr job should be
> > splitting the job in to one mapper per region
> > if coded setup correctly. New patches in dev set for 0.20 will allow more
> > mappers per region speeding up this in some cases.
> >
> > Row-based database can have index but they do not scale well index
> require
> > more memory
> > Hbase is designed to be Distributed parallel fault tolerant that scales
> > easy from 1 to hundreds to thousands of servers
> >
> > Billy
> >
> >
> >
> > "Ric Wang" <wq...@gmail.com> wrote in message
> > news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
> >
> >  Hi,
> >>
> >> Thanks. But if it is still scanning EVERY row in the entire table, how
> >> does
> >> HBase achieve better scan performance, compared to a row-based database?
> >>
> >> Thanks,
> >> Ric
> >>
> >>
> >>
> >> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >>
> >>  Without the use of indexes, there is no easy way to get the info
> without
> >>> touching every row.
> >>>
> >>> So yes you'll be scanning every row.  But hbase has good bulk scan
> perf.
> >>>
> >>> On Jun 9, 2009 7:24 PM, "Ric Wang" <wq...@gmail.com> wrote:
> >>>
> >>> How does the scanner know how to get ONLY the "relevant" rows, without
> a
> >>> whole table scan?
> >>>
> >>> Thanks!
> >>> Ric
> >>>
> >>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <na...@gmail.com>
> >>> wrote:
> >>> > The scanner only s...
> >>> --
> >>>
> >>> Ric Wang wqt.work@gmail.com
> >>>
> >>>
> >>
> >>
> >> --
> >> Ric Wang
> >> wqt.work@gmail.com
> >>
> >>
> >
> >
>
>
> --
> Ric Wang
> wqt.work@gmail.com
>

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Ric Wang <wq...@gmail.com>.
Billy,

Thank you, it's clearer to me now. But WITHIN the one family where the
column-label that needs to be scanned over lives (since I only have one
family for the entire table), it will still have to scan EVERY row in that
family no matter if each cell on that column-label has value or not?

-Ric


On Wed, Jun 10, 2009 at 1:03 AM, Billy Pearson
<sa...@pearsonwholesale.com>wrote:

> It will not scan every row if there is more then one column family only the
> rows that have data for that column.
>
> You do have parallelism when scanning large tables the mr job should be
> splitting the job in to one mapper per region
> if coded setup correctly. New patches in dev set for 0.20 will allow more
> mappers per region speeding up this in some cases.
>
> Row-based database can have index but they do not scale well index require
> more memory
> Hbase is designed to be Distributed parallel fault tolerant that scales
> easy from 1 to hundreds to thousands of servers
>
> Billy
>
>
>
> "Ric Wang" <wq...@gmail.com> wrote in message
> news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
>
>  Hi,
>>
>> Thanks. But if it is still scanning EVERY row in the entire table, how
>> does
>> HBase achieve better scan performance, compared to a row-based database?
>>
>> Thanks,
>> Ric
>>
>>
>>
>> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>
>>  Without the use of indexes, there is no easy way to get the info without
>>> touching every row.
>>>
>>> So yes you'll be scanning every row.  But hbase has good bulk scan perf.
>>>
>>> On Jun 9, 2009 7:24 PM, "Ric Wang" <wq...@gmail.com> wrote:
>>>
>>> How does the scanner know how to get ONLY the "relevant" rows, without a
>>> whole table scan?
>>>
>>> Thanks!
>>> Ric
>>>
>>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <na...@gmail.com>
>>> wrote:
>>> > The scanner only s...
>>> --
>>>
>>> Ric Wang wqt.work@gmail.com
>>>
>>>
>>
>>
>> --
>> Ric Wang
>> wqt.work@gmail.com
>>
>>
>
>


-- 
Ric Wang
wqt.work@gmail.com

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Billy Pearson <sa...@pearsonwholesale.com>.
It will not scan every row if there is more then one column family only the 
rows that have data for that column.

You do have parallelism when scanning large tables the mr job should be 
splitting the job in to one mapper per region
if coded setup correctly. New patches in dev set for 0.20 will allow more 
mappers per region speeding up this in some cases.

Row-based database can have index but they do not scale well index require 
more memory
Hbase is designed to be Distributed parallel fault tolerant that scales easy 
from 1 to hundreds to thousands of servers

Billy



"Ric Wang" <wq...@gmail.com> wrote in 
message news:21224f560906092144o703e9292o1587a74cceae2a3@mail.gmail.com...
> Hi,
>
> Thanks. But if it is still scanning EVERY row in the entire table, how 
> does
> HBase achieve better scan performance, compared to a row-based database?
>
> Thanks,
> Ric
>
>
>
> On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson 
> <ry...@gmail.com> wrote:
>
>> Without the use of indexes, there is no easy way to get the info without
>> touching every row.
>>
>> So yes you'll be scanning every row.  But hbase has good bulk scan perf.
>>
>> On Jun 9, 2009 7:24 PM, "Ric Wang" 
>> <wq...@gmail.com> wrote:
>>
>> How does the scanner know how to get ONLY the "relevant" rows, without a
>> whole table scan?
>>
>> Thanks!
>> Ric
>>
>> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula 
>> <na...@gmail.com>
>> wrote:
>> > The scanner only s...
>> --
>>
>> Ric Wang wqt.work@gmail.com
>>
>
>
>
> -- 
> Ric Wang
> wqt.work@gmail.com
> 



Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Ryan Rawson <ry...@gmail.com>.
Via parallelism? Just add machines. Also a simpler on disk format has
immutable files allows for rapid scanning without concurrency issues during
writes

On Jun 9, 2009 9:44 PM, "Ric Wang" <wq...@gmail.com> wrote:

Hi,

Thanks. But if it is still scanning EVERY row in the entire table, how does
HBase achieve better scan performance, compared to a row-based database?

Thanks,
Ric

On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson <ry...@gmail.com> wrote: >
Without the use of ind...
--

Ric Wang wqt.work@gmail.com

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Ric Wang <wq...@gmail.com>.
Hi,

Thanks. But if it is still scanning EVERY row in the entire table, how does
HBase achieve better scan performance, compared to a row-based database?

Thanks,
Ric



On Tue, Jun 9, 2009 at 9:35 PM, Ryan Rawson <ry...@gmail.com> wrote:

> Without the use of indexes, there is no easy way to get the info without
> touching every row.
>
> So yes you'll be scanning every row.  But hbase has good bulk scan perf.
>
> On Jun 9, 2009 7:24 PM, "Ric Wang" <wq...@gmail.com> wrote:
>
> How does the scanner know how to get ONLY the "relevant" rows, without a
> whole table scan?
>
> Thanks!
> Ric
>
> On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <na...@gmail.com>
> wrote:
> > The scanner only s...
> --
>
> Ric Wang wqt.work@gmail.com
>



-- 
Ric Wang
wqt.work@gmail.com

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Ryan Rawson <ry...@gmail.com>.
Without the use of indexes, there is no easy way to get the info without
touching every row.

So yes you'll be scanning every row.  But hbase has good bulk scan perf.

On Jun 9, 2009 7:24 PM, "Ric Wang" <wq...@gmail.com> wrote:

How does the scanner know how to get ONLY the "relevant" rows, without a
whole table scan?

Thanks!
Ric

On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <na...@gmail.com> wrote:
> The scanner only s...
--

Ric Wang wqt.work@gmail.com

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Ric Wang <wq...@gmail.com>.
How does the scanner know how to get ONLY the "relevant" rows, without a
whole table scan?

Thanks!
Ric



On Tue, Jun 9, 2009 at 4:31 PM, Naveen Koorakula <na...@gmail.com> wrote:

> The scanner only scans the relevant rows.
>
> On Tue, Jun 9, 2009 at 2:10 PM, Ric Wang <wq...@gmail.com> wrote:
>
> > Hi,
> >
> > My HBase table has millions of rows; and on given column (ex.
> > famliyA:labelB), only a couple of thousand rows really have values
> > (sparse).
> > Now my task is to find out the set of row keys whose column value of
> > "familyA:labelB" satisfy some kind of condition.
> >
> > For that task, I am getting a scanner on the column "familyA:labelB";
> > looping over the values of that column (I guess I'd better off using some
> > kind of filter instead, but regardless...); if the value matches my
> > condition, I get the corresponding row key and add it into the result
> set.
> >
> > My questions are:
> >
> > 1. When the scanner loops over the column, is it scanning the whole table
> > of
> > millions of rows, or mostly just the ones that really have values for
> that
> > particular column? My guess is that it's NOT scanning the whole table per
> > my
> > very limited understanding of how column-based database works; seems
> that'd
> > be awfully inefficient. Can someone please let me know?
> >
> > 2. If in the unfortunate case, that whole table scan does have to happen,
> > any suggestions on how I could change my table design (adding index..?)
> to
> > avoid the performance hit?
> >
> > Thanks very much for your help!
> > Ric
> >
>



-- 
Ric Wang
wqt.work@gmail.com

Re: scanner on a given column: whole table scan or just the rows that have values

Posted by Naveen Koorakula <na...@gmail.com>.
The scanner only scans the relevant rows.

On Tue, Jun 9, 2009 at 2:10 PM, Ric Wang <wq...@gmail.com> wrote:

> Hi,
>
> My HBase table has millions of rows; and on given column (ex.
> famliyA:labelB), only a couple of thousand rows really have values
> (sparse).
> Now my task is to find out the set of row keys whose column value of
> "familyA:labelB" satisfy some kind of condition.
>
> For that task, I am getting a scanner on the column "familyA:labelB";
> looping over the values of that column (I guess I'd better off using some
> kind of filter instead, but regardless...); if the value matches my
> condition, I get the corresponding row key and add it into the result set.
>
> My questions are:
>
> 1. When the scanner loops over the column, is it scanning the whole table
> of
> millions of rows, or mostly just the ones that really have values for that
> particular column? My guess is that it's NOT scanning the whole table per
> my
> very limited understanding of how column-based database works; seems that'd
> be awfully inefficient. Can someone please let me know?
>
> 2. If in the unfortunate case, that whole table scan does have to happen,
> any suggestions on how I could change my table design (adding index..?) to
> avoid the performance hit?
>
> Thanks very much for your help!
> Ric
>

RE: scanner on a given column: whole table scan or just the rows that have values

Posted by Br...@nokia.com.
My guess is that the scanner actually does examine every row. As you suggest, adding a RowFilter would be the way to go here.  This way, you're certain to get back only those rows that match the criteria expressed in the RowFilter.

-brian

________________________________________
From: ext Ric Wang [wqt.work@gmail.com]
Sent: Tuesday, June 09, 2009 5:10 PM
To: hbase-user@hadoop.apache.org
Subject: scanner on a given column: whole table scan or just the rows that      have values

Hi,

My HBase table has millions of rows; and on given column (ex.
famliyA:labelB), only a couple of thousand rows really have values (sparse).
Now my task is to find out the set of row keys whose column value of
"familyA:labelB" satisfy some kind of condition.

For that task, I am getting a scanner on the column "familyA:labelB";
looping over the values of that column (I guess I'd better off using some
kind of filter instead, but regardless...); if the value matches my
condition, I get the corresponding row key and add it into the result set.

My questions are:

1. When the scanner loops over the column, is it scanning the whole table of
millions of rows, or mostly just the ones that really have values for that
particular column? My guess is that it's NOT scanning the whole table per my
very limited understanding of how column-based database works; seems that'd
be awfully inefficient. Can someone please let me know?

2. If in the unfortunate case, that whole table scan does have to happen,
any suggestions on how I could change my table design (adding index..?) to
avoid the performance hit?

Thanks very much for your help!
Ric