You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by SiMaYunRui <my...@hotmail.com> on 2014/07/19 17:23:50 UTC

How to limit columns returned by a single row in HBase

Hi experts,



I have a wide-flat table, and during scan, how can I limit columns returned by a single row, instead of all rows (what ColumnCountGetFilter does)? Because I need to scan multiple rows at the same time, and in client side to do aggregation. 

Put more background, I am designing an auditing tools, which records pattern of “(who) operates against (what) at (when)”. The search pattern is like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list all files (what part, start-with search) be operated in DESC order of (when).

I have tens of millions of records per day, and keep them 30 - 90 days. So I am thinking about two designs: a) rowkey as (file name)_(reverse of when), problem is that people want to use start-wth search to match multiple files, in this way, scan has to go thru all matches files, which could be huge and then client has to re-order them to display 500 records on top; It could be very slow;

b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to day to partition)). qualifier is (reverse of when). This design can leverage the fact that qualifiers are in order to make fewer search than #a in my personal opinion. But I cannot put all operations on a single file in one row, because total number might exceeds multiple millions. 

So I am thinking of grouping data into the following shape by using #b. Then back to my original question, because I only need 500 records, if the row (file A)_(2014/06/14), contains more than that number, can I stop it and then continue to scan next row? And if I already get enough in (file A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to scan (file B)_(2014/06/14) which is a different file?

Row: (file A)_(2014/06/14) 

   d:1341069600 value 

   d:1341069500 value 

   d:1341069400 value

Row: (file A)_(2014/06/13) 

   d:1341059600 value 

   d:1341059500 value 

   d:1341059400 value

Row: (file B)_(2014/06/14) 

   d:1341069700 value 

   d:1341069580 value 

   d:1341069401 value






发自 Windows 邮件

Re: How to limit columns returned by a single row in HBase

Posted by Ted Yu <yu...@gmail.com>.

You can write your own filter, based on ColumnCountGetFilter, by not
overriding filterAllRemaining() method.

In filterKeyValue() method, when count is bigger than limit, the method
returns NEXT_ROW.

Your filter can remember the file prefix of the previous row. If file
prefix of current row is the same as that of the previous row, return
NEXT_ROW from filterRowKey().

Cheers


On Sat, Jul 19, 2014 at 8:23 AM, SiMaYunRui <my...@hotmail.com> wrote:

> Hi experts,
>
>
>
> I have a wide-flat table, and during scan, how can I limit columns
> returned by a single row, instead of all rows (what ColumnCountGetFilter
> does)? Because I need to scan multiple rows at the same time, and in client
> side to do aggregation.
>
> Put more background, I am designing an auditing tools, which records
> pattern of “(who) operates against (what) at (when)”. The search pattern is
> like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list all
> files (what part, start-with search) be operated in DESC order of (when).
>
> I have tens of millions of records per day, and keep them 30 - 90 days. So
> I am thinking about two designs: a) rowkey as (file name)_(reverse of
> when), problem is that people want to use start-wth search to match
> multiple files, in this way, scan has to go thru all matches files, which
> could be huge and then client has to re-order them to display 500 records
> on top; It could be very slow;
>
> b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to
> day to partition)). qualifier is (reverse of when). This design can
> leverage the fact that qualifiers are in order to make fewer search than #a
> in my personal opinion. But I cannot put all operations on a single file in
> one row, because total number might exceeds multiple millions.
>
> So I am thinking of grouping data into the following shape by using #b.
> Then back to my original question, because I only need 500 records, if the
> row (file A)_(2014/06/14), contains more than that number, can I stop it
> and then continue to scan next row? And if I already get enough in (file
> A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to scan
> (file B)_(2014/06/14) which is a different file?
>
> Row: (file A)_(2014/06/14)
>
>    d:1341069600 value
>
>    d:1341069500 value
>
>    d:1341069400 value
>
> Row: (file A)_(2014/06/13)
>
>    d:1341059600 value
>
>    d:1341059500 value
>
>    d:1341059400 value
>
> Row: (file B)_(2014/06/14)
>
>    d:1341069700 value
>
>    d:1341069580 value
>
>    d:1341069401 value
>
>
>
>
>
>
> 发自 Windows 邮件

Re: How to limit columns returned by a single row in HBase

Posted by Ted Yu <yu...@gmail.com>.

If I understand SiMa's use case correctly, after top records for (file A)_
are returned, (file B)_ would be next.
Therefore some kind of filter is needed server side to skip the remaining
records for (file A)_.

Another (corner) case is that for certain file prefix, there may not be as
many records as the preset (per file) limit.

Cheers


On Sat, Jul 19, 2014 at 1:41 PM, Arun Allamsetty <ar...@gmail.com>
wrote:

> Hi,
>
> I have an idea which might be just bulloni, but people learn from mistakes
> and this is my attempt to learn. So if I properly understand user use case,
> you want to get the first 500 records pertaining to a file based on its
> file name. Since you want to limit the number of records written, I won't
> recommend writing each record as a column. But what if instead, we could
> create a composite key consisting of the file name and the timestamp
> (epoch) in a fashion similar to as described in Flurry - The Delicate Art
> of Organizing Data in HBase <http://www.flurry.com/2012/06/12/137492485>.
> If you want the latest timestamp first, we can use Long.MAX_VALUE -
> timestamp in the constructor for the composite key. Now to get the top 500
> records for, let's say, *fileA* and *2014/06/14*, convert the date to epoch
> time and create an object of the composite key class you created. Create a
> *Scan* object, specifying it as the start row and use
> *Scan#setMaxResultSize* to 500. That should give you only the top 500
> records and I believe the performance won't be bad provided you have the
> hardware to manage your data volume.
>
> Experts, please correct me wherever I am wrong.
>
> Thanks,
> Arun
>
>
> On Sat, Jul 19, 2014 at 9:23 AM, SiMaYunRui <my...@hotmail.com> wrote:
>
> > Hi experts,
> >
> >
> >
> > I have a wide-flat table, and during scan, how can I limit columns
> > returned by a single row, instead of all rows (what ColumnCountGetFilter
> > does)? Because I need to scan multiple rows at the same time, and in
> client
> > side to do aggregation.
> >
> > Put more background, I am designing an auditing tools, which records
> > pattern of “(who) operates against (what) at (when)”. The search pattern
> is
> > like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list
> all
> > files (what part, start-with search) be operated in DESC order of (when).
> >
> > I have tens of millions of records per day, and keep them 30 - 90 days.
> So
> > I am thinking about two designs: a) rowkey as (file name)_(reverse of
> > when), problem is that people want to use start-wth search to match
> > multiple files, in this way, scan has to go thru all matches files, which
> > could be huge and then client has to re-order them to display 500 records
> > on top; It could be very slow;
> >
> > b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to
> > day to partition)). qualifier is (reverse of when). This design can
> > leverage the fact that qualifiers are in order to make fewer search than
> #a
> > in my personal opinion. But I cannot put all operations on a single file
> in
> > one row, because total number might exceeds multiple millions.
> >
> > So I am thinking of grouping data into the following shape by using #b.
> > Then back to my original question, because I only need 500 records, if
> the
> > row (file A)_(2014/06/14), contains more than that number, can I stop it
> > and then continue to scan next row? And if I already get enough in (file
> > A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to
> scan
> > (file B)_(2014/06/14) which is a different file?
> >
> > Row: (file A)_(2014/06/14)
> >
> >    d:1341069600 value
> >
> >    d:1341069500 value
> >
> >    d:1341069400 value
> >
> > Row: (file A)_(2014/06/13)
> >
> >    d:1341059600 value
> >
> >    d:1341059500 value
> >
> >    d:1341059400 value
> >
> > Row: (file B)_(2014/06/14)
> >
> >    d:1341069700 value
> >
> >    d:1341069580 value
> >
> >    d:1341069401 value
> >
> >
> >
> >
> >
> >
> > 发自 Windows 邮件
>

Re: How to limit columns returned by a single row in HBase

Posted by Arun Allamsetty <ar...@gmail.com>.

Hi,

I have an idea which might be just bulloni, but people learn from mistakes
and this is my attempt to learn. So if I properly understand user use case,
you want to get the first 500 records pertaining to a file based on its
file name. Since you want to limit the number of records written, I won't
recommend writing each record as a column. But what if instead, we could
create a composite key consisting of the file name and the timestamp
(epoch) in a fashion similar to as described in Flurry - The Delicate Art
of Organizing Data in HBase <http://www.flurry.com/2012/06/12/137492485>.
If you want the latest timestamp first, we can use Long.MAX_VALUE -
timestamp in the constructor for the composite key. Now to get the top 500
records for, let's say, *fileA* and *2014/06/14*, convert the date to epoch
time and create an object of the composite key class you created. Create a
*Scan* object, specifying it as the start row and use
*Scan#setMaxResultSize* to 500. That should give you only the top 500
records and I believe the performance won't be bad provided you have the
hardware to manage your data volume.

Experts, please correct me wherever I am wrong.

Thanks,
Arun

On Sat, Jul 19, 2014 at 9:23 AM, SiMaYunRui <my...@hotmail.com> wrote:

> Hi experts,
>
>
>
> I have a wide-flat table, and during scan, how can I limit columns
> returned by a single row, instead of all rows (what ColumnCountGetFilter
> does)? Because I need to scan multiple rows at the same time, and in client
> side to do aggregation.
>
> Put more background, I am designing an auditing tools, which records
> pattern of “(who) operates against (what) at (when)”. The search pattern is
> like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list all
> files (what part, start-with search) be operated in DESC order of (when).
>
> I have tens of millions of records per day, and keep them 30 - 90 days. So
> I am thinking about two designs: a) rowkey as (file name)_(reverse of
> when), problem is that people want to use start-wth search to match
> multiple files, in this way, scan has to go thru all matches files, which
> could be huge and then client has to re-order them to display 500 records
> on top; It could be very slow;
>
> b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to
> day to partition)). qualifier is (reverse of when). This design can
> leverage the fact that qualifiers are in order to make fewer search than #a
> in my personal opinion. But I cannot put all operations on a single file in
> one row, because total number might exceeds multiple millions.
>
> So I am thinking of grouping data into the following shape by using #b.
> Then back to my original question, because I only need 500 records, if the
> row (file A)_(2014/06/14), contains more than that number, can I stop it
> and then continue to scan next row? And if I already get enough in (file
> A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to scan
> (file B)_(2014/06/14) which is a different file?
>
> Row: (file A)_(2014/06/14)
>
>    d:1341069600 value
>
>    d:1341069500 value
>
>    d:1341069400 value
>
> Row: (file A)_(2014/06/13)
>
>    d:1341059600 value
>
>    d:1341059500 value
>
>    d:1341059400 value
>
> Row: (file B)_(2014/06/14)
>
>    d:1341069700 value
>
>    d:1341069580 value
>
>    d:1341069401 value
>
>
>
>
>
>
> 发自 Windows 邮件