You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Dai, Kevin" <yu...@ebay.com> on 2014/08/26 09:13:55 UTC

ResultScanner performance

Hi, everyone

My application will hold tens of thousands of ResultScanner to get Data. Will it hurt the performance and network resources?
If so, is there any way to solve it?
Thanks,
Kevin.

Re: ResultScanner performance

Posted by Jianshi Huang <ji...@gmail.com>.

Ah, sure. That's a good idea. I know how to do it now. :)

Thanks for the help.

Jianshi


On Thu, Aug 28, 2014 at 12:29 PM, Ted Yu <yu...@gmail.com> wrote:

> You can enhance ColumnRangeFilter to return the first column in the range.
>
> In its filterKeyValue(Cell kv) method:
>
>     int cmpMax = Bytes.compareTo(buffer, qualifierOffset, qualifierLength,
>
>         this.maxColumn, 0, this.maxColumn.length);
>
>     if (this.maxColumnInclusive && cmpMax <= 0 ||
>
>         !this.maxColumnInclusive && cmpMax < 0) {
>
>       return ReturnCode.INCLUDE;
>
>     }
>
> ReturnCode.NEXT_ROW should be returned (for subsequent columns) once
> ReturnCode.INCLUDE is returned for the first column in range.
>
> Cheers
>
>
> On Wed, Aug 27, 2014 at 9:05 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
> > Very similar. We setup a column range (we're using ColumnRangeFilter
> right
> > now), and we want the first column in the range.
> >
> > The problem we have a lot of rows.
> >
> > If there's no such capability, then we need to control the parallelism
> > ourselves.
> >
> > Shall I sort the rows first before scanning? Will a random order be more
> > efficient if we have many servers?
> >
> > Jianshi
> >
> >
> > On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > So you want to specify several columns. e.g. c2, c3, and c4, the GET is
> > > supposed to return the first one of them (doesn't have to be c2, can be
> > c3
> > > if c2 is absent) ?
> > >
> > > To my knowledge there is no such capability now.
> > >
> > > Cheers
> > >
> > >
> > > On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang <
> jianshi.huang@gmail.com
> > >
> > > wrote:
> > >
> > > > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <
> > jianshi.huang@gmail.com>
> > > > wrote:
> > > >
> > > > >
> > > > > There's a special but common case that for each row we only need
> the
> > > > first
> > > > > column. Is there a better way to do this than multiple scans +
> > take(1)?
> > > > >
> > > >
> > > > We still need to set a column range, is there a way to get the first
> > > column
> > > > value of a range using GET?
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: ResultScanner performance

Posted by Ted Yu <yu...@gmail.com>.

You can enhance ColumnRangeFilter to return the first column in the range.

In its filterKeyValue(Cell kv) method:

    int cmpMax = Bytes.compareTo(buffer, qualifierOffset, qualifierLength,

        this.maxColumn, 0, this.maxColumn.length);

    if (this.maxColumnInclusive && cmpMax <= 0 ||

        !this.maxColumnInclusive && cmpMax < 0) {

      return ReturnCode.INCLUDE;

    }

ReturnCode.NEXT_ROW should be returned (for subsequent columns) once
ReturnCode.INCLUDE is returned for the first column in range.

Cheers


On Wed, Aug 27, 2014 at 9:05 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Very similar. We setup a column range (we're using ColumnRangeFilter right
> now), and we want the first column in the range.
>
> The problem we have a lot of rows.
>
> If there's no such capability, then we need to control the parallelism
> ourselves.
>
> Shall I sort the rows first before scanning? Will a random order be more
> efficient if we have many servers?
>
> Jianshi
>
>
> On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > So you want to specify several columns. e.g. c2, c3, and c4, the GET is
> > supposed to return the first one of them (doesn't have to be c2, can be
> c3
> > if c2 is absent) ?
> >
> > To my knowledge there is no such capability now.
> >
> > Cheers
> >
> >
> > On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang <jianshi.huang@gmail.com
> >
> > wrote:
> >
> > > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <
> jianshi.huang@gmail.com>
> > > wrote:
> > >
> > > >
> > > > There's a special but common case that for each row we only need the
> > > first
> > > > column. Is there a better way to do this than multiple scans +
> take(1)?
> > > >
> > >
> > > We still need to set a column range, is there a way to get the first
> > column
> > > value of a range using GET?
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: ResultScanner performance

Posted by Jianshi Huang <ji...@gmail.com>.

Very similar. We setup a column range (we're using ColumnRangeFilter right
now), and we want the first column in the range.

The problem we have a lot of rows.

If there's no such capability, then we need to control the parallelism
ourselves.

Shall I sort the rows first before scanning? Will a random order be more
efficient if we have many servers?

Jianshi

On Thu, Aug 28, 2014 at 1:44 AM, Ted Yu <yu...@gmail.com> wrote:

> So you want to specify several columns. e.g. c2, c3, and c4, the GET is
> supposed to return the first one of them (doesn't have to be c2, can be c3
> if c2 is absent) ?
>
> To my knowledge there is no such capability now.
>
> Cheers
>
>
> On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
> > On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <ji...@gmail.com>
> > wrote:
> >
> > >
> > > There's a special but common case that for each row we only need the
> > first
> > > column. Is there a better way to do this than multiple scans + take(1)?
> > >
> >
> > We still need to set a column range, is there a way to get the first
> column
> > value of a range using GET?
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: ResultScanner performance

Posted by Ted Yu <yu...@gmail.com>.

So you want to specify several columns. e.g. c2, c3, and c4, the GET is
supposed to return the first one of them (doesn't have to be c2, can be c3
if c2 is absent) ?

To my knowledge there is no such capability now.

Cheers

On Wed, Aug 27, 2014 at 10:28 AM, Jianshi Huang <ji...@gmail.com>
wrote:

> On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
> >
> > There's a special but common case that for each row we only need the
> first
> > column. Is there a better way to do this than multiple scans + take(1)?
> >
>
> We still need to set a column range, is there a way to get the first column
> value of a range using GET?
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: ResultScanner performance

Posted by Jianshi Huang <ji...@gmail.com>.

On Thu, Aug 28, 2014 at 1:20 AM, Jianshi Huang <ji...@gmail.com>
wrote:

>
> There's a special but common case that for each row we only need the first
> column. Is there a better way to do this than multiple scans + take(1)?
>

We still need to set a column range, is there a way to get the first column
value of a range using GET?

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: ResultScanner performance

Posted by Jianshi Huang <ji...@gmail.com>.

Hi,

The reason we cannot close the ResultScanner (or issue a multi-get), is
that we have wide rows with many columns, and we want to iterate over them
rather than get all the columns at once.

There's a special but common case that for each row we only need the first
column. Is there a better way to do this than multiple scans + take(1)?

Jianshi



On Wed, Aug 27, 2014 at 12:44 PM, Dai, Kevin <yu...@ebay.com> wrote:

> Hi, Ted
>
> I think you are right. But we must hold the ResultScanner for a while. So
> is there any way to reduce the performance loss? Or is there any way to
> share the connection?
>
> Best regards,
> Kevin.
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: 2014年8月27日 11:36
> To: user@hbase.apache.org
> Subject: Re: ResultScanner performance
>
> Keeping many ResultScanners open at the same time is not good for
> performance.
>
> Please see:
> http://hbase.apache.org/book.html#perf.hbase.client.scannerclose
>
> After fetching results from ResultScanner, you should close it ASAP.
>
> Cheers
>
>
> On Tue, Aug 26, 2014 at 8:18 PM, Dai, Kevin <yu...@ebay.com> wrote:
>
> > Hi, Ted
> >
> > We have a cluster of 48 machines and at least 100T data(which is still
> > increasing).
> > The problem is that we have a lot of row keys (about tens of thousands
> > ) to query in the meantime and we don't fetch all the data at once,
> > instead we fetch them when needed, so we may hold tens of thousands
> > ResultScanner in the meantime.
> > I want to know whether it will hurt the performance and network
> > resources and if so, is there any way to solve it?
> >
> > Best regards,
> > Kevin.
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: 2014年8月26日 16:49
> > To: user@hbase.apache.org
> > Cc: user@hbase.apache.org; Huang, Jianshi
> > Subject: Re: ResultScanner performance
> >
> > Can you give a bit more detail ?
> > What size is the cluster / dataset ?
> > What problem are you solving ?
> > Would using coprocessor help reduce the usage of ResultScanner ?
> >
> > Cheers
> >
> > On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" <yu...@ebay.com> wrote:
> >
> > > Hi, everyone
> > >
> > > My application will hold tens of thousands of ResultScanner to get
> Data.
> > Will it hurt the performance and network resources?
> > > If so, is there any way to solve it?
> > > Thanks,
> > > Kevin.
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: ResultScanner performance

Posted by "Dai, Kevin" <yu...@ebay.com>.

Hi, Ted

I think you are right. But we must hold the ResultScanner for a while. So is there any way to reduce the performance loss? Or is there any way to share the connection?

Best regards,
Kevin.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: 2014年8月27日 11:36
To: user@hbase.apache.org
Subject: Re: ResultScanner performance

Keeping many ResultScanners open at the same time is not good for performance.

Please see:
http://hbase.apache.org/book.html#perf.hbase.client.scannerclose

After fetching results from ResultScanner, you should close it ASAP.

Cheers


On Tue, Aug 26, 2014 at 8:18 PM, Dai, Kevin <yu...@ebay.com> wrote:

> Hi, Ted
>
> We have a cluster of 48 machines and at least 100T data(which is still 
> increasing).
> The problem is that we have a lot of row keys (about tens of thousands 
> ) to query in the meantime and we don't fetch all the data at once, 
> instead we fetch them when needed, so we may hold tens of thousands 
> ResultScanner in the meantime.
> I want to know whether it will hurt the performance and network 
> resources and if so, is there any way to solve it?
>
> Best regards,
> Kevin.
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: 2014年8月26日 16:49
> To: user@hbase.apache.org
> Cc: user@hbase.apache.org; Huang, Jianshi
> Subject: Re: ResultScanner performance
>
> Can you give a bit more detail ?
> What size is the cluster / dataset ?
> What problem are you solving ?
> Would using coprocessor help reduce the usage of ResultScanner ?
>
> Cheers
>
> On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" <yu...@ebay.com> wrote:
>
> > Hi, everyone
> >
> > My application will hold tens of thousands of ResultScanner to get Data.
> Will it hurt the performance and network resources?
> > If so, is there any way to solve it?
> > Thanks,
> > Kevin.
>

Re: ResultScanner performance

Posted by Ted Yu <yu...@gmail.com>.

Keeping many ResultScanners open at the same time is not good for
performance.

Please see:
http://hbase.apache.org/book.html#perf.hbase.client.scannerclose

After fetching results from ResultScanner, you should close it ASAP.

Cheers


On Tue, Aug 26, 2014 at 8:18 PM, Dai, Kevin <yu...@ebay.com> wrote:

> Hi, Ted
>
> We have a cluster of 48 machines and at least 100T data(which is still
> increasing).
> The problem is that we have a lot of row keys (about tens of thousands )
> to query in the meantime and we don't fetch all the data at once, instead
> we fetch them when needed,
> so we may hold tens of thousands ResultScanner in the meantime.
> I want to know whether it will hurt the performance and network resources
> and if so, is there any way to solve it?
>
> Best regards,
> Kevin.
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: 2014年8月26日 16:49
> To: user@hbase.apache.org
> Cc: user@hbase.apache.org; Huang, Jianshi
> Subject: Re: ResultScanner performance
>
> Can you give a bit more detail ?
> What size is the cluster / dataset ?
> What problem are you solving ?
> Would using coprocessor help reduce the usage of ResultScanner ?
>
> Cheers
>
> On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" <yu...@ebay.com> wrote:
>
> > Hi, everyone
> >
> > My application will hold tens of thousands of ResultScanner to get Data.
> Will it hurt the performance and network resources?
> > If so, is there any way to solve it?
> > Thanks,
> > Kevin.
>

RE: ResultScanner performance

Posted by "Dai, Kevin" <yu...@ebay.com>.

Hi, Ted

We have a cluster of 48 machines and at least 100T data(which is still increasing).
The problem is that we have a lot of row keys (about tens of thousands ) to query in the meantime and we don't fetch all the data at once, instead we fetch them when needed,
so we may hold tens of thousands ResultScanner in the meantime.
I want to know whether it will hurt the performance and network resources and if so, is there any way to solve it?

Best regards,
Kevin.
-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: 2014年8月26日 16:49
To: user@hbase.apache.org
Cc: user@hbase.apache.org; Huang, Jianshi
Subject: Re: ResultScanner performance

Can you give a bit more detail ?
What size is the cluster / dataset ?
What problem are you solving ?
Would using coprocessor help reduce the usage of ResultScanner ?

Cheers

On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" <yu...@ebay.com> wrote:

> Hi, everyone
> 
> My application will hold tens of thousands of ResultScanner to get Data. Will it hurt the performance and network resources?
> If so, is there any way to solve it?
> Thanks,
> Kevin.

Re: ResultScanner performance

Posted by Ted Yu <yu...@gmail.com>.

Can you give a bit more detail ?
What size is the cluster / dataset ?
What problem are you solving ?
Would using coprocessor help reduce the usage of ResultScanner ?

Cheers

On Aug 26, 2014, at 12:13 AM, "Dai, Kevin" <yu...@ebay.com> wrote:

> Hi, everyone
> 
> My application will hold tens of thousands of ResultScanner to get Data. Will it hurt the performance and network resources?
> If so, is there any way to solve it?
> Thanks,
> Kevin.