You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "ming.liu" <mi...@esgyn.cn> on 2018/12/29 15:01:56 UTC

Will Scan use blockcache?

Hi, all,

 

I recently found that short scan is slower than get operation in HBase. It
is acceptable, but I really want to understand the reason.

 

My testing table only has one row in it. So both Scan and Get just get one
row. Scan is still about 2x slower than get operation.

So I want to understand the difference between get(rowkey) and Scan(rowkey,
rowkey).

 

I think Get will first match in blockcache, if matched, it will go back
without accessing HFile/Memstore; 

Will Scan search in blockcache as well? Or it directly go to memstore/HFile?

 

thanks,

Ming

Re: Will Scan use blockcache?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

Which version do you use?

Get is always a one time rpc call, but scan may lead to multiple rpc calls,
which depends on the hbase version and also some flags in the Scan object.

ming.liu <mi...@esgyn.cn>于2018年12月30日 周日00:06写道：

> Thanks Stack,
>
> I have an impression that Get makes a Scan under the cover. But that
> cannot explain my observation of the performance difference between Get a
> single row vs. San a single row.
>
> I assume the difference comes from the blockcache, Get() will first match
> the block cache, if it matches, the call finish and return back. But Scan
> will not match the block cache, it will go to memstore and then go to HFile
> if it is not in the memstore.
>
> My test program will do Get in a loop, for example, 1000 times of Get.
> Before the loop, I save the startime, and then after 1000 loops of Get,
> save the endtime. So (endtime - startime) / loop-count is the time spent in
> each Get operation.
> I have that same loop, replacing get() with scan(). The scan() will have
>  startRowKey = endRowkey, so it is just one row.
>
> I run the test program many times, using HBase 1.2.0. It shows the Scan is
> 2x slower than the get. So I want to understand the root cause. I assume
> get() will match the row in blockcache, so it will not go to the memstore
> or HFile. But scan() must go to HFile, because in my test, there is no put
> operation, just pure read. The row was inserted long time ago. So it should
> flush into HFile, and not in the memstore anymore. But I cannot
> confirm/verify this. So scan() have to send a request to HDFS to read from
> HFile, and it is slower than the get() operation.
>
> I can paste the test program if the description is still not clear.
>
> I may need to replace Scan with Get whenever possible, if there do have a
> performance difference. But if it is not true, I don't bother to modify
> this.
>
> thanks,
> Ming
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Saturday, December 29, 2018 11:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: Will Scan use blockcache?
>
> A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
> both have to go to memstore since it will have latest versions of Cells.
>
> Say more about how you are doing the compare please.
>
> S
>
> On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:
>
> > Hi, all,
> >
> >
> >
> > I recently found that short scan is slower than get operation in HBase.
> It
> > is acceptable, but I really want to understand the reason.
> >
> >
> >
> > My testing table only has one row in it. So both Scan and Get just get
> one
> > row. Scan is still about 2x slower than get operation.
> >
> > So I want to understand the difference between get(rowkey) and
> Scan(rowkey,
> > rowkey).
> >
> >
> >
> > I think Get will first match in blockcache, if matched, it will go back
> > without accessing HFile/Memstore;
> >
> > Will Scan search in blockcache as well? Or it directly go to
> > memstore/HFile?
> >
> >
> >
> > thanks,
> >
> > Ming
> >
> >
> >
> >
>
>

Re: Will Scan use blockcache?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

Oh, I see, 1.2.0. Try scan.setSmall(true)?

Stack <st...@duboce.net> 于2019年1月1日周二 上午6:29写道：

> On Sat, Dec 29, 2018 at 8:06 AM ming.liu <mi...@esgyn.cn> wrote:
>
> > Thanks Stack,
> >
> > I have an impression that Get makes a Scan under the cover. But that
> > cannot explain my observation of the performance difference between Get a
> > single row vs. San a single row.
> >
> >
> Here is how the Get gets converted into a Scan:
>
> https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6920
> Maybe try doing same in your experiment and if still a difference, flle an
> issue and upload your test code. Explain how you ran your test (copy/paste
> from here). branch-1.2 is old. I'd be interested in trying your test
> against branch-2 to see if it has the issue you see.
>
>
> > I assume the difference comes from the blockcache, Get() will first match
> > the block cache, if it matches, the call finish and return back. But Scan
> > will not match the block cache, it will go to memstore and then go to
> HFile
> > if it is not in the memstore.
> >
> >
> We first go to memstore, and if we have not satisfied the query, then go to
> hfiles. Hfiles will fetch from blocks from blockcache if present else will
> go to hdfs (and then populate cache). Should work this way whether Get or
> Scan.
>
> Thanks,
> S
>
>
> > My test program will do Get in a loop, for example, 1000 times of Get.
> > Before the loop, I save the startime, and then after 1000 loops of Get,
> > save the endtime. So (endtime - startime) / loop-count is the time spent
> in
> > each Get operation.
> > I have that same loop, replacing get() with scan(). The scan() will have
> >  startRowKey = endRowkey, so it is just one row.
> >
> > I run the test program many times, using HBase 1.2.0. It shows the Scan
> is
> > 2x slower than the get. So I want to understand the root cause. I assume
> > get() will match the row in blockcache, so it will not go to the memstore
> > or HFile. But scan() must go to HFile, because in my test, there is no
> put
> > operation, just pure read. The row was inserted long time ago. So it
> should
> > flush into HFile, and not in the memstore anymore. But I cannot
> > confirm/verify this. So scan() have to send a request to HDFS to read
> from
> > HFile, and it is slower than the get() operation.
> >
> > I can paste the test program if the description is still not clear.
> >
> > I may need to replace Scan with Get whenever possible, if there do have a
> > performance difference. But if it is not true, I don't bother to modify
> > this.
> >
> > thanks,
> > Ming
> >
> > -----Original Message-----
> > From: Stack <st...@duboce.net>
> > Sent: Saturday, December 29, 2018 11:50 PM
> > To: Hbase-User <us...@hbase.apache.org>
> > Subject: Re: Will Scan use blockcache?
> >
> > A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
> > both have to go to memstore since it will have latest versions of Cells.
> >
> > Say more about how you are doing the compare please.
> >
> > S
> >
> > On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:
> >
> > > Hi, all,
> > >
> > >
> > >
> > > I recently found that short scan is slower than get operation in HBase.
> > It
> > > is acceptable, but I really want to understand the reason.
> > >
> > >
> > >
> > > My testing table only has one row in it. So both Scan and Get just get
> > one
> > > row. Scan is still about 2x slower than get operation.
> > >
> > > So I want to understand the difference between get(rowkey) and
> > Scan(rowkey,
> > > rowkey).
> > >
> > >
> > >
> > > I think Get will first match in blockcache, if matched, it will go back
> > > without accessing HFile/Memstore;
> > >
> > > Will Scan search in blockcache as well? Or it directly go to
> > > memstore/HFile?
> > >
> > >
> > >
> > > thanks,
> > >
> > > Ming
> > >
> > >
> > >
> > >
> >
> >
>

RE: Will Scan use blockcache?

Posted by "ming.liu" <mi...@esgyn.cn>.

Thank you Stack,
I will do more testing. The code you pointed out is very clear that get() is using scan().
I believe the performance difference is coming from the RPC. The get test using table.get(), one call; The scan test call two APIs getScanner() then use next() method from the ResultScanner, which I believe is another RPC.

I will test more. If necessary, I will file a JIRA.

thanks,
Ming

-----Original Message-----
From: Stack <st...@duboce.net> 
Sent: Tuesday, January 01, 2019 6:29 AM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: Will Scan use blockcache?

On Sat, Dec 29, 2018 at 8:06 AM ming.liu <mi...@esgyn.cn> wrote:

> Thanks Stack,
>
> I have an impression that Get makes a Scan under the cover. But that
> cannot explain my observation of the performance difference between Get a
> single row vs. San a single row.
>
>
Here is how the Get gets converted into a Scan:
https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6920
Maybe try doing same in your experiment and if still a difference, flle an
issue and upload your test code. Explain how you ran your test (copy/paste
from here). branch-1.2 is old. I'd be interested in trying your test
against branch-2 to see if it has the issue you see.


> I assume the difference comes from the blockcache, Get() will first match
> the block cache, if it matches, the call finish and return back. But Scan
> will not match the block cache, it will go to memstore and then go to HFile
> if it is not in the memstore.
>
>
We first go to memstore, and if we have not satisfied the query, then go to
hfiles. Hfiles will fetch from blocks from blockcache if present else will
go to hdfs (and then populate cache). Should work this way whether Get or
Scan.

Thanks,
S


> My test program will do Get in a loop, for example, 1000 times of Get.
> Before the loop, I save the startime, and then after 1000 loops of Get,
> save the endtime. So (endtime - startime) / loop-count is the time spent in
> each Get operation.
> I have that same loop, replacing get() with scan(). The scan() will have
>  startRowKey = endRowkey, so it is just one row.
>
> I run the test program many times, using HBase 1.2.0. It shows the Scan is
> 2x slower than the get. So I want to understand the root cause. I assume
> get() will match the row in blockcache, so it will not go to the memstore
> or HFile. But scan() must go to HFile, because in my test, there is no put
> operation, just pure read. The row was inserted long time ago. So it should
> flush into HFile, and not in the memstore anymore. But I cannot
> confirm/verify this. So scan() have to send a request to HDFS to read from
> HFile, and it is slower than the get() operation.
>
> I can paste the test program if the description is still not clear.
>
> I may need to replace Scan with Get whenever possible, if there do have a
> performance difference. But if it is not true, I don't bother to modify
> this.
>
> thanks,
> Ming
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Saturday, December 29, 2018 11:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: Will Scan use blockcache?
>
> A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
> both have to go to memstore since it will have latest versions of Cells.
>
> Say more about how you are doing the compare please.
>
> S
>
> On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:
>
> > Hi, all,
> >
> >
> >
> > I recently found that short scan is slower than get operation in HBase.
> It
> > is acceptable, but I really want to understand the reason.
> >
> >
> >
> > My testing table only has one row in it. So both Scan and Get just get
> one
> > row. Scan is still about 2x slower than get operation.
> >
> > So I want to understand the difference between get(rowkey) and
> Scan(rowkey,
> > rowkey).
> >
> >
> >
> > I think Get will first match in blockcache, if matched, it will go back
> > without accessing HFile/Memstore;
> >
> > Will Scan search in blockcache as well? Or it directly go to
> > memstore/HFile?
> >
> >
> >
> > thanks,
> >
> > Ming
> >
> >
> >
> >
>
>

Re: Will Scan use blockcache?

Posted by Stack <st...@duboce.net>.

On Sat, Dec 29, 2018 at 8:06 AM ming.liu <mi...@esgyn.cn> wrote:

> Thanks Stack,
>
> I have an impression that Get makes a Scan under the cover. But that
> cannot explain my observation of the performance difference between Get a
> single row vs. San a single row.
>
>
Here is how the Get gets converted into a Scan:
https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6920
Maybe try doing same in your experiment and if still a difference, flle an
issue and upload your test code. Explain how you ran your test (copy/paste
from here). branch-1.2 is old. I'd be interested in trying your test
against branch-2 to see if it has the issue you see.


> I assume the difference comes from the blockcache, Get() will first match
> the block cache, if it matches, the call finish and return back. But Scan
> will not match the block cache, it will go to memstore and then go to HFile
> if it is not in the memstore.
>
>
We first go to memstore, and if we have not satisfied the query, then go to
hfiles. Hfiles will fetch from blocks from blockcache if present else will
go to hdfs (and then populate cache). Should work this way whether Get or
Scan.

Thanks,
S


> My test program will do Get in a loop, for example, 1000 times of Get.
> Before the loop, I save the startime, and then after 1000 loops of Get,
> save the endtime. So (endtime - startime) / loop-count is the time spent in
> each Get operation.
> I have that same loop, replacing get() with scan(). The scan() will have
>  startRowKey = endRowkey, so it is just one row.
>
> I run the test program many times, using HBase 1.2.0. It shows the Scan is
> 2x slower than the get. So I want to understand the root cause. I assume
> get() will match the row in blockcache, so it will not go to the memstore
> or HFile. But scan() must go to HFile, because in my test, there is no put
> operation, just pure read. The row was inserted long time ago. So it should
> flush into HFile, and not in the memstore anymore. But I cannot
> confirm/verify this. So scan() have to send a request to HDFS to read from
> HFile, and it is slower than the get() operation.
>
> I can paste the test program if the description is still not clear.
>
> I may need to replace Scan with Get whenever possible, if there do have a
> performance difference. But if it is not true, I don't bother to modify
> this.
>
> thanks,
> Ming
>
> -----Original Message-----
> From: Stack <st...@duboce.net>
> Sent: Saturday, December 29, 2018 11:50 PM
> To: Hbase-User <us...@hbase.apache.org>
> Subject: Re: Will Scan use blockcache?
>
> A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
> both have to go to memstore since it will have latest versions of Cells.
>
> Say more about how you are doing the compare please.
>
> S
>
> On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:
>
> > Hi, all,
> >
> >
> >
> > I recently found that short scan is slower than get operation in HBase.
> It
> > is acceptable, but I really want to understand the reason.
> >
> >
> >
> > My testing table only has one row in it. So both Scan and Get just get
> one
> > row. Scan is still about 2x slower than get operation.
> >
> > So I want to understand the difference between get(rowkey) and
> Scan(rowkey,
> > rowkey).
> >
> >
> >
> > I think Get will first match in blockcache, if matched, it will go back
> > without accessing HFile/Memstore;
> >
> > Will Scan search in blockcache as well? Or it directly go to
> > memstore/HFile?
> >
> >
> >
> > thanks,
> >
> > Ming
> >
> >
> >
> >
>
>

RE: Will Scan use blockcache?

Posted by "ming.liu" <mi...@esgyn.cn>.

Thanks Stack,

I have an impression that Get makes a Scan under the cover. But that cannot explain my observation of the performance difference between Get a single row vs. San a single row.

I assume the difference comes from the blockcache, Get() will first match the block cache, if it matches, the call finish and return back. But Scan will not match the block cache, it will go to memstore and then go to HFile if it is not in the memstore.

My test program will do Get in a loop, for example, 1000 times of Get. Before the loop, I save the startime, and then after 1000 loops of Get, save the endtime. So (endtime - startime) / loop-count is the time spent in each Get operation.
I have that same loop, replacing get() with scan(). The scan() will have   startRowKey = endRowkey, so it is just one row.

I run the test program many times, using HBase 1.2.0. It shows the Scan is 2x slower than the get. So I want to understand the root cause. I assume get() will match the row in blockcache, so it will not go to the memstore or HFile. But scan() must go to HFile, because in my test, there is no put operation, just pure read. The row was inserted long time ago. So it should flush into HFile, and not in the memstore anymore. But I cannot confirm/verify this. So scan() have to send a request to HDFS to read from HFile, and it is slower than the get() operation.

I can paste the test program if the description is still not clear.

I may need to replace Scan with Get whenever possible, if there do have a performance difference. But if it is not true, I don't bother to modify this.

thanks,
Ming

-----Original Message-----
From: Stack <st...@duboce.net> 
Sent: Saturday, December 29, 2018 11:50 PM
To: Hbase-User <us...@hbase.apache.org>
Subject: Re: Will Scan use blockcache?

A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
both have to go to memstore since it will have latest versions of Cells.

Say more about how you are doing the compare please.

S

On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:

> Hi, all,
>
>
>
> I recently found that short scan is slower than get operation in HBase. It
> is acceptable, but I really want to understand the reason.
>
>
>
> My testing table only has one row in it. So both Scan and Get just get one
> row. Scan is still about 2x slower than get operation.
>
> So I want to understand the difference between get(rowkey) and Scan(rowkey,
> rowkey).
>
>
>
> I think Get will first match in blockcache, if matched, it will go back
> without accessing HFile/Memstore;
>
> Will Scan search in blockcache as well? Or it directly go to
> memstore/HFile?
>
>
>
> thanks,
>
> Ming
>
>
>
>

Re: Will Scan use blockcache?

Posted by Stack <st...@duboce.net>.

A Get is a one-row Scan. Under the covers the Get makes a Scan. Scan/Get
both have to go to memstore since it will have latest versions of Cells.

Say more about how you are doing the compare please.

S

On Sat, Dec 29, 2018 at 7:02 AM ming.liu <mi...@esgyn.cn> wrote:

> Hi, all,
>
>
>
> I recently found that short scan is slower than get operation in HBase. It
> is acceptable, but I really want to understand the reason.
>
>
>
> My testing table only has one row in it. So both Scan and Get just get one
> row. Scan is still about 2x slower than get operation.
>
> So I want to understand the difference between get(rowkey) and Scan(rowkey,
> rowkey).
>
>
>
> I think Get will first match in blockcache, if matched, it will go back
> without accessing HFile/Memstore;
>
> Will Scan search in blockcache as well? Or it directly go to
> memstore/HFile?
>
>
>
> thanks,
>
> Ming
>
>
>
>