You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2014/09/10 10:40:30 UTC

Scan vs Parallel scan.

Hi,

I developed an distributed scan, I create an thread for each region. After
that, I've tried to get some times Scan vs DistributedScan.
I have disabled blockcache in my table. My cluster has 3 region servers
with 2 regions each one, in total there are 100.000 rows and execute a
complete scan.

My partitions are
-01666 -> request 16665
016666-033332 -> request 16666
033332-049998 -> request 16666
049998-066664 -> request 16666
066664-083330 -> request 16666
083330- -> request 16671


14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
Caching 10

14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
Caching 100

14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
Caching 1000

14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
Caching 1

14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
Caching 100

14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
Caching 1000

Parallel scan works much worse than simple scan,, and I don't know why it's
so fast,, it's really much faster than execute an "count" from hbase shell,
what it doesn't look pretty notmal. The only time that it works better
parallel is when I execute a normal scan with caching 1.

Any clue about it?

Re: Scan vs Parallel scan.

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Guillermo,

Thanks for the additional information. How large is the difference between
the shell count command and the single threaded scan you use? e.g. in the
order of 1% or 200%? can you tell us which filter are you using for the
scan? Have you fully verified that you are in fact not using the block
cache at all and all your reads bypass the cache and go directly to HDFS?

thanks,
esteban.


--
Cloudera, Inc.


On Wed, Sep 10, 2014 at 1:41 PM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> What I want to say that I don't understand why a count takes more time than
> a complete scan without cache. I thought it should take more time to scan
> the table than to execute a count.
> Another point is why is slower an distributed scan than a sequential scan.
> Tomorrow I'll check how many disk we have.
>
> El miércoles, 10 de septiembre de 2014, Esteban Gutierrez <
> esteban@cloudera.com> escribió:
>
> > Hello Guillermo,
> >
> > Sounds like some potential contention going on, how many disks per node
> you
> > have?
> >
> > Can you explain further what do you mean by "and I don't know why it's so
> > fast,, it's really much faster than execute an "count" from hbase shell,"
> > the count command from the shell uses the FirstKeyOnlyFilter and a
> caching
> > of 10 which should be close to the behavior of your testing tool if its
> > using the same filter and the same cache settings.
> >
> > cheers,
> > esteban.
> >
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> > On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>
> > wrote:
> >
> > > Hi,
> > >
> > > I developed an distributed scan, I create an thread for each region.
> > After
> > > that, I've tried to get some times Scan vs DistributedScan.
> > > I have disabled blockcache in my table. My cluster has 3 region servers
> > > with 2 regions each one, in total there are 100.000 rows and execute a
> > > complete scan.
> > >
> > > My partitions are
> > > -01666 -> request 16665
> > > 016666-033332 -> request 16666
> > > 033332-049998 -> request 16666
> > > 049998-066664 -> request 16666
> > > 066664-083330 -> request 16666
> > > 083330- -> request 16671
> > >
> > >
> > > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2
> ->
> > > Caching 10
> > >
> > > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2
> ->
> > > Caching 100
> > >
> > > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2
> ->
> > > Caching 1000
> > >
> > > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> > > Caching 1
> > >
> > > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> > > Caching 100
> > >
> > > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> > > Caching 1000
> > >
> > > Parallel scan works much worse than simple scan,, and I don't know why
> > it's
> > > so fast,, it's really much faster than execute an "count" from hbase
> > shell,
> > > what it doesn't look pretty notmal. The only time that it works better
> > > parallel is when I execute a normal scan with caching 1.
> > >
> > > Any clue about it?
> > >
> >
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

What I want to say that I don't understand why a count takes more time than
a complete scan without cache. I thought it should take more time to scan
the table than to execute a count.
Another point is why is slower an distributed scan than a sequential scan.
Tomorrow I'll check how many disk we have.

El miércoles, 10 de septiembre de 2014, Esteban Gutierrez <
esteban@cloudera.com> escribió:

> Hello Guillermo,
>
> Sounds like some potential contention going on, how many disks per node you
> have?
>
> Can you explain further what do you mean by "and I don't know why it's so
> fast,, it's really much faster than execute an "count" from hbase shell,"
> the count command from the shell uses the FirstKeyOnlyFilter and a caching
> of 10 which should be close to the behavior of your testing tool if its
> using the same filter and the same cache settings.
>
> cheers,
> esteban.
>
>
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> wrote:
>
> > Hi,
> >
> > I developed an distributed scan, I create an thread for each region.
> After
> > that, I've tried to get some times Scan vs DistributedScan.
> > I have disabled blockcache in my table. My cluster has 3 region servers
> > with 2 regions each one, in total there are 100.000 rows and execute a
> > complete scan.
> >
> > My partitions are
> > -01666 -> request 16665
> > 016666-033332 -> request 16666
> > 033332-049998 -> request 16666
> > 049998-066664 -> request 16666
> > 066664-083330 -> request 16666
> > 083330- -> request 16671
> >
> >
> > 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> > Caching 10
> >
> > 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> > Caching 100
> >
> > 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> > Caching 1000
> >
> > 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> > Caching 1
> >
> > 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> > Caching 100
> >
> > 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> > Caching 1000
> >
> > Parallel scan works much worse than simple scan,, and I don't know why
> it's
> > so fast,, it's really much faster than execute an "count" from hbase
> shell,
> > what it doesn't look pretty notmal. The only time that it works better
> > parallel is when I execute a normal scan with caching 1.
> >
> > Any clue about it?
> >
>

Re: Scan vs Parallel scan.

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Guillermo,

Sounds like some potential contention going on, how many disks per node you
have?

Can you explain further what do you mean by "and I don't know why it's so
fast,, it's really much faster than execute an "count" from hbase shell,"
the count command from the shell uses the FirstKeyOnlyFilter and a caching
of 10 which should be close to the behavior of your testing tool if its
using the same filter and the same cache settings.

cheers,
esteban.




--
Cloudera, Inc.


On Wed, Sep 10, 2014 at 1:40 AM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> Hi,
>
> I developed an distributed scan, I create an thread for each region. After
> that, I've tried to get some times Scan vs DistributedScan.
> I have disabled blockcache in my table. My cluster has 3 region servers
> with 2 regions each one, in total there are 100.000 rows and execute a
> complete scan.
>
> My partitions are
> -01666 -> request 16665
> 016666-033332 -> request 16666
> 033332-049998 -> request 16666
> 049998-066664 -> request 16666
> 066664-083330 -> request 16666
> 083330- -> request 16671
>
>
> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> Caching 10
>
> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> Caching 1000
>
> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> Caching 1
>
> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> Caching 1000
>
> Parallel scan works much worse than simple scan,, and I don't know why it's
> so fast,, it's really much faster than execute an "count" from hbase shell,
> what it doesn't look pretty notmal. The only time that it works better
> parallel is when I execute a normal scan with caching 1.
>
> Any clue about it?
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I attach the code than I'm executing. I don't have accss to the generator
to HBase.
In the last benchmark, simple scan takes about 4 times less than this
version.

With that version is available just to do complete scans.
I have been trying a complete scan of a HTable with 100.000 rows and it
takes less than one second, is it not too fast???




2014-09-14 20:21 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> I don't have the code here. But I created a class RegionScanner, this
> class does a complete scan of a region. So I have to set the start and stop
> keys. the start and stop key are the limits of that region.
>
> El domingo, 14 de septiembre de 2014, Anoop John <an...@gmail.com>
> escribió:
>
> Again full code snippet can better speak.
>>
>> But not getting what u r doing with below code
>>
>> private List<RegionScanner> generatePartitions() {
>>         List<RegionScanner> regionScanners = new
>> ArrayList<RegionScanner>();
>>         byte[] startKey;
>>         byte[] stopKey;
>>         HConnection connection = null;
>>         HBaseAdmin hbaseAdmin = null;
>>         try {
>>             connection = HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>>             hbaseAdmin = new HBaseAdmin(connection);
>>             List<HRegionInfo> regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>             RegionScanner regionScanner = null;
>>             for (HRegionInfo region : regions) {
>>
>>                 startKey = region.getStartKey();
>>                 stopKey = region.getEndKey();
>>
>>                 regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>                 // regionScanner = createRegionScanner(startKey, stopKey);
>>                 if (regionScanner != null) {
>>                     regionScanners.add(regionScanner);
>>                 }
>>             }
>>
>> And I execute the RegionScanner with this:
>> public List<Result> call() throws Exception {
>>         HConnection connection =
>> HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>>         HTableInterface table =
>> connection.getTable(configuration.getTable());
>>
>>     Scan scan = new Scan(startKey, stopKey);
>>         scan.setBatch(configuration.getBatch());
>>         scan.setCaching(configuration.getCaching());
>>         ResultScanner resultScanner = table.getScanner(scan);
>>
>>
>> What is this part?
>> new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>
>>
>> >>Scan scan = new Scan(startKey, stopKey);
>>         scan.setBatch(configuration.
>> getBatch());
>>         scan.setCaching(configuration.getCaching());
>>         ResultScanner resultScanner = table.getScanner(scan);
>>
>>
>> And not setting start and stop rows to this Scan object? !!
>>
>>
>> Sorry If I missed some parts from ur code.
>>
>> -Anoop-
>>
>>
>> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>> > I don't have the code here,, but I'll put the code in a couple of days.
>> I
>> > have to check the executeservice again! I don't remember exactly how I
>> did.
>> >
>> > I'm using Hbase 0.98.
>> >
>> > El domingo, 14 de septiembre de 2014, lars hofhansl <la...@apache.org>
>> > escribió:
>> >
>> > > What specific version of 0.94 are you using?
>> > >
>> > > In general, if you have multiple spindles (disks) and/or multiple CPU
>> > > cores at the region server you should benefits from keeping multiple
>> > region
>> > > server handler threads busy. I have experimented with this before and
>> > saw a
>> > > close to linear speed up (up to the point where all disks/core were
>> > busy).
>> > > Obviously this also assuming this is the only load you throw at the
>> > servers
>> > > at this point.
>> > >
>> > > Can you post your complete code to pastebin? Maybe even with some
>> code to
>> > > seed the data?
>> > > How do you run your callables? Did you configure the ExecuteService
>> > > correctly (assuming you use one to run your callables)?
>> > >
>> > > Then we can run it and have a look.
>> > >
>> > > Thanks.
>> > >
>> > > -- Lars
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
>> > > To: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org
>> > > <javascript:;>>
>> > > Cc:
>> > > Sent: Saturday, September 13, 2014 4:49 PM
>> > > Subject: Re: Scan vs Parallel scan.
>> > >
>> > > What am I missing??
>> > >
>> > >
>> > >
>> > >
>> > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
>> > > <javascript:;>>:
>> > >
>> > > > For an partial scan, I guess that I call to the RS to get data, it
>> > starts
>> > > > looking in the store files and recollecting the data. (It doesn't
>> write
>> > > to
>> > > > the blockcache in both cases). It has ready the data and it gives to
>> > the
>> > > > client the data step by step, I mean,,, it depends the caching and
>> > > batching
>> > > > parameters.
>> > > >
>> > > > Big differences that I see...
>> > > > I'm opening more connections to the Table, one for Region.
>> > > >
>> > > > I should check the single table scan, it looks like it does partial
>> > scans
>> > > > sequentially. Since you can see on the HBase Master how the request
>> > > > increase one after another, not all in the same time.
>> > > >
>> > > > 2014-09-12 15:23 GMT+02:00 Michael Segel <michael_segel@hotmail.com
>> > > <javascript:;>>:
>> > > >
>> > > >> It doesn’t matter which RS, but that you have 1 thread for each
>> > region.
>> > > >>
>> > > >> So for each thread, what’s happening.
>> > > >> Step by step, what is the code doing.
>> > > >>
>> > > >> Now you’re comparing this against a single table scan, right?
>> > > >> What’s happening in the table scan…?
>> > > >>
>> > > >>
>> > > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2000@gmail.com
>> > > <javascript:;>>
>> > > >> wrote:
>> > > >>
>> > > >> > Right, My table for example has keys between 0-9. in three
>> regions
>> > > >> > 0-2,3-7,7-9
>> > > >> > I lauch three partial scans in parallel. The scans that I'm
>> > executing
>> > > >> are:
>> > > >> > scan(0,2), scan(3,7), scan(7,9).
>> > > >> > Each region is if a different RS, so each thread goes to
>> different
>> > RS.
>> > > >> It's
>> > > >> > not exactly like that, but on the benchmark case it's like it's
>> > > working.
>> > > >> >
>> > > >> > Really the code will execute a thread for each Region not for
>> each
>> > > >> > RegionServer. But in the test I only have two regions for
>> > > regionServer.
>> > > >> I
>> > > >> > dont' think that's an important point, there're two threads for
>> RS.
>> > > >> >
>> > > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <
>> michael_segel@hotmail.com
>> > > <javascript:;>>:
>> > > >> >
>> > > >> >> Ok, lets again take a step back…
>> > > >> >>
>> > > >> >> So you are comparing your partial scan(s) against a full table
>> > scan?
>> > > >> >>
>> > > >> >> If I understood your question, you launch 3 partial scans where
>> you
>> > > set
>> > > >> >> the start row and then end row of each scan, right?
>> > > >> >>
>> > > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <
>> konstt2000@gmail.com
>> > > <javascript:;>>
>> > > >> wrote:
>> > > >> >>
>> > > >> >>> Okay, then, the partial scan doesn't work as I think.
>> > > >> >>> How could it exceed the limit of a single region if I calculate
>> > the
>> > > >> >> limits?
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> The only bad point that I see it's that If a region server has
>> > three
>> > > >> >>> regions of the same table,  I'm executing three partial scans
>> > about
>> > > >> this
>> > > >> >> RS
>> > > >> >>> and they could compete for resources (network, etc..) on this
>> > node.
>> > > >> It'd
>> > > >> >> be
>> > > >> >>> better to have one thread for RS. But, that doesn't answer your
>> > > >> >> questions.
>> > > >> >>>
>> > > >> >>> I keep thinking...
>> > > >> >>>
>> > > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <
>> > michael_segel@hotmail.com
>> > > <javascript:;>>:
>> > > >> >>>
>> > > >> >>>> Hi,
>> > > >> >>>>
>> > > >> >>>> I wanted to take a step back from the actual code and to stop
>> and
>> > > >> think
>> > > >> >>>> about what you are doing and what HBase is doing under the
>> > covers.
>> > > >> >>>>
>> > > >> >>>> So in your code, you are asking HBase to do 3 separate scans
>> and
>> > > then
>> > > >> >> you
>> > > >> >>>> take the result set back and join it.
>> > > >> >>>>
>> > > >> >>>> What does HBase do when it does a range scan?
>> > > >> >>>> What happens when that range scan exceeds a single region?
>> > > >> >>>>
>> > > >> >>>> If you answer those questions… you’ll have your answer.
>> > > >> >>>>
>> > > >> >>>> HTH
>> > > >> >>>>
>> > > >> >>>> -Mike
>> > > >> >>>>
>> > > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <
>> > konstt2000@gmail.com
>> > > <javascript:;>>
>> > > >> >> wrote:
>> > > >> >>>>
>> > > >> >>>>> It's not all the code, I set things like these as well:
>> > > >> >>>>> scan.setMaxVersions();
>> > > >> >>>>> scan.setCacheBlocks(false);
>> > > >> >>>>> ...
>> > > >> >>>>>
>> > > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <
>> konstt2000@gmail.com
>> > > <javascript:;>>:
>> > > >> >>>>>
>> > > >> >>>>>> yes, that is. I have changed the HBase version to 0.98
>> > > >> >>>>>>
>> > > >> >>>>>> I got the start and stop keys with this method:
>> > > >> >>>>>> private List<RegionScanner> generatePartitions() {
>> > > >> >>>>>>      List<RegionScanner> regionScanners = new
>> > > >> >>>>>> ArrayList<RegionScanner>();
>> > > >> >>>>>>      byte[] startKey;
>> > > >> >>>>>>      byte[] stopKey;
>> > > >> >>>>>>      HConnection connection = null;
>> > > >> >>>>>>      HBaseAdmin hbaseAdmin = null;
>> > > >> >>>>>>      try {
>> > > >> >>>>>>          connection = HConnectionManager.
>> > > >> >>>>>> createConnection(HBaseConfiguration.create());
>> > > >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>> > > >> >>>>>>          List<HRegionInfo> regions =
>> > > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> > > >> >>>>>>          RegionScanner regionScanner = null;
>> > > >> >>>>>>          for (HRegionInfo region : regions) {
>> > > >> >>>>>>
>> > > >> >>>>>>              startKey = region.getStartKey();
>> > > >> >>>>>>              stopKey = region.getEndKey();
>> > > >> >>>>>>
>> > > >> >>>>>>              regionScanner = new RegionScanner(startKey,
>> > stopKey,
>> > > >> >>>>>> scanConfiguration);
>> > > >> >>>>>>              // regionScanner =
>> createRegionScanner(startKey,
>> > > >> >>>> stopKey);
>> > > >> >>>>>>              if (regionScanner != null) {
>> > > >> >>>>>>                  regionScanners.add(regionScanner);
>> > > >> >>>>>>              }
>> > > >> >>>>>>          }
>> > > >> >>>>>>
>> > > >> >>>>>> And I execute the RegionScanner with this:
>> > > >> >>>>>> public List<Result> call() throws Exception {
>> > > >> >>>>>>      HConnection connection =
>> > > >> >>>>>>
>> > HConnectionManager.createConnection(HBaseConfiguration.create());
>> > > >> >>>>>>      HTableInterface table =
>> > > >> >>>>>> connection.getTable(configuration.getTable());
>> > > >> >>>>>>
>> > > >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
>> > > >> >>>>>>      scan.setBatch(configuration.getBatch());
>> > > >> >>>>>>      scan.setCaching(configuration.getCaching());
>> > > >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>> > > >> >>>>>>
>> > > >> >>>>>>      List<Result> results = new ArrayList<Result>();
>> > > >> >>>>>>      for (Result result : resultScanner) {
>> > > >> >>>>>>          results.add(result);
>> > > >> >>>>>>      }
>> > > >> >>>>>>
>> > > >> >>>>>>      connection.close();
>> > > >> >>>>>>      table.close();
>> > > >> >>>>>>
>> > > >> >>>>>>      return results;
>> > > >> >>>>>>  }
>> > > >> >>>>>>
>> > > >> >>>>>> They implement Callable.
>> > > >> >>>>>>
>> > > >> >>>>>>
>> > > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
>> > > michael_segel@hotmail.com <javascript:;>
>> > > >> >:
>> > > >> >>>>>>
>> > > >> >>>>>>> Lets take a step back….
>> > > >> >>>>>>>
>> > > >> >>>>>>> Your parallel scan is having the client create N threads
>> where
>> > > in
>> > > >> >> each
>> > > >> >>>>>>> thread, you’re doing a partial scan of the table where each
>> > > >> partial
>> > > >> >>>> scan
>> > > >> >>>>>>> takes the first and last row of each region?
>> > > >> >>>>>>>
>> > > >> >>>>>>> Is that correct?
>> > > >> >>>>>>>
>> > > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
>> > > >> konstt2000@gmail.com <javascript:;>>
>> > > >> >>>>>>> wrote:
>> > > >> >>>>>>>
>> > > >> >>>>>>>> I was checking a little bit more about,, I checked the
>> > cluster
>> > > >> and
>> > > >> >>>> data
>> > > >> >>>>>>> is
>> > > >> >>>>>>>> store in three different regions servers, each one in a
>> > > >> differente
>> > > >> >>>> node.
>> > > >> >>>>>>>> So, I guess the threads go to different hard-disks.
>> > > >> >>>>>>>>
>> > > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a
>> > single
>> > > >> scan
>> > > >> >>>>>>> than
>> > > >> >>>>>>>> this implementation. I based on this implementation
>> > > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>> > > >> >>>>>>>>
>> > > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
>> > > konstt2000@gmail.com <javascript:;>
>> > > >> >:
>> > > >> >>>>>>>>
>> > > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
>> > > 0.98,
>> > > >> >>>>>>> although
>> > > >> >>>>>>>>> there is not difference.
>> > > >> >>>>>>>>> I disabled the table and disabled the blockcache for that
>> > > family
>> > > >> >> and
>> > > >> >>>> I
>> > > >> >>>>>>> put
>> > > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> I think that it's not possible that I executing an
>> complete
>> > > scan
>> > > >> >> for
>> > > >> >>>>>>> each
>> > > >> >>>>>>>>> thread since my data are the type:
>> > > >> >>>>>>>>> 000001 f:q value=1
>> > > >> >>>>>>>>> 000002 f:q value=2
>> > > >> >>>>>>>>> 000003 f:q value=3
>> > > >> >>>>>>>>> ...
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> I add all the values and get the same result on a single
>> > scan
>> > > >> than
>> > > >> >> a
>> > > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
>> > > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds,
>> I
>> > > don't
>> > > >> >>>>>>> remember,
>> > > >> >>>>>>>>> but like 4x  of the scan time.
>> > > >> >>>>>>>>> I'm not using any filter for the scans.
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> This is the way I calculate number of regions/scans
>> > > >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
>> > > >> >>>>>>>>>     List<RegionScanner> regionScanners = new
>> > > >> >>>>>>>>> ArrayList<RegionScanner>();
>> > > >> >>>>>>>>>     byte[] startKey;
>> > > >> >>>>>>>>>     byte[] stopKey;
>> > > >> >>>>>>>>>     HConnection connection = null;
>> > > >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>> > > >> >>>>>>>>>     try {
>> > > >> >>>>>>>>>         connection =
>> > > >> >>>>>>>>>
>> > > >> HConnectionManager.createConnection(HBaseConfiguration.create());
>> > > >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>> > > >> >>>>>>>>>         List<HRegionInfo> regions =
>> > > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> > > >> >>>>>>>>>         RegionScanner regionScanner = null;
>> > > >> >>>>>>>>>         for (HRegionInfo region : regions) {
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>             startKey = region.getStartKey();
>> > > >> >>>>>>>>>             stopKey = region.getEndKey();
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
>> > > stopKey,
>> > > >> >>>>>>>>> scanConfiguration);
>> > > >> >>>>>>>>>             // regionScanner =
>> createRegionScanner(startKey,
>> > > >> >>>>>>> stopKey);
>> > > >> >>>>>>>>>             if (regionScanner != null) {
>> > > >> >>>>>>>>>                 regionScanners.add(regionScanner);
>> > > >> >>>>>>>>>             }
>> > > >> >>>>>>>>>         }
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> I did some test for a tiny table and I think that the
>> range
>> > > for
>> > > >> >> each
>> > > >> >>>>>>> scan
>> > > >> >>>>>>>>> works fine. Although, I though that it was interesting
>> that
>> > > the
>> > > >> >> time
>> > > >> >>>>>>> when I
>> > > >> >>>>>>>>> execute distributed scan is about 6x.
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> I'm going to check about the hard disks, but I think that
>> > ti's
>> > > >> >> right.
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <
>> larsh@apache.org
>> > > <javascript:;>>:
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>> Which version of HBase?
>> > > >> >>>>>>>>>> Can you show us the code?
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as
>> long
>> > as
>> > > >> the
>> > > >> >>>>>>> single
>> > > >> >>>>>>>>>> scan, which is suspicious because you say you have 6
>> > regions.
>> > > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the
>> data
>> > in
>> > > >> each
>> > > >> >>>> of
>> > > >> >>>>>>>>>> your parallel scans?
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> -- Lars
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> ________________________________
>> > > >> >>>>>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com
>> > <javascript:;>>
>> > > >> >>>>>>>>>> To: "user@hbase.apache.org <javascript:;>" <
>> > > user@hbase.apache.org <javascript:;>>
>> > > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>> > > >> >>>>>>>>>> Subject: Scan vs Parallel scan.
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> Hi,
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> I developed an distributed scan, I create an thread for
>> > each
>> > > >> >> region.
>> > > >> >>>>>>> After
>> > > >> >>>>>>>>>> that, I've tried to get some times Scan vs
>> DistributedScan.
>> > > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
>> > > region
>> > > >> >>>>>>> servers
>> > > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows
>> > and
>> > > >> >>>> execute a
>> > > >> >>>>>>>>>> complete scan.
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> My partitions are
>> > > >> >>>>>>>>>> -01666 -> request 16665
>> > > >> >>>>>>>>>> 016666-033332 -> request 16666
>> > > >> >>>>>>>>>> 033332-049998 -> request 16666
>> > > >> >>>>>>>>>> 049998-066664 -> request 16666
>> > > >> >>>>>>>>>> 066664-083330 -> request 16666
>> > > >> >>>>>>>>>> 083330- -> request 16671
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> > > >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
>> > > >> >>>>>>>>>> Caching 10
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> > > >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
>> > > >> >>>>>>>>>> Caching 100
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> > > >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
>> > > >> >>>>>>>>>> Caching 1000
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> > > >> >> NORMAL:68288ms,Counter:2
>> > > >> >>>>>>> ->
>> > > >> >>>>>>>>>> Caching 1
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> > > >> >> NORMAL:2646ms,Counter:2
>> > > >> >>>> ->
>> > > >> >>>>>>>>>> Caching 100
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS
>> 100000
>> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> > > >> >> NORMAL:3903ms,Counter:2
>> > > >> >>>> ->
>> > > >> >>>>>>>>>> Caching 1000
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I
>> > don't
>> > > >> know
>> > > >> >>>> why
>> > > >> >>>>>>>>>> it's
>> > > >> >>>>>>>>>> so fast,, it's really much faster than execute an
>> "count"
>> > > from
>> > > >> >> hbase
>> > > >> >>>>>>>>>> shell,
>> > > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that
>> it
>> > > works
>> > > >> >>>> better
>> > > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>> Any clue about it?
>> > > >> >>>>>>>>>>
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>>>
>> > > >> >>>>>>>
>> > > >> >>>>>>>
>> > > >> >>>>>>
>> > > >> >>>>
>> > > >> >>>>
>> > > >> >>
>> > > >> >>
>> > > >>
>> > > >>
>> > > >
>> > >
>> >
>>
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I don't have the code here. But I created a class RegionScanner, this class
does a complete scan of a region. So I have to set the start and stop keys.
the start and stop key are the limits of that region.

El domingo, 14 de septiembre de 2014, Anoop John <an...@gmail.com>
escribió:

> Again full code snippet can better speak.
>
> But not getting what u r doing with below code
>
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> And I execute the RegionScanner with this:
> public List<Result> call() throws Exception {
>         HConnection connection =
> HConnectionManager.
> createConnection(HBaseConfiguration.create());
>         HTableInterface table =
> connection.getTable(configuration.getTable());
>
>     Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>
> What is this part?
> new RegionScanner(startKey, stopKey,
> scanConfiguration);
>
>
> >>Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.
> getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>
> And not setting start and stop rows to this Scan object? !!
>
>
> Sorry If I missed some parts from ur code.
>
> -Anoop-
>
>
> On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> wrote:
>
> > I don't have the code here,, but I'll put the code in a couple of days. I
> > have to check the executeservice again! I don't remember exactly how I
> did.
> >
> > I'm using Hbase 0.98.
> >
> > El domingo, 14 de septiembre de 2014, lars hofhansl <larsh@apache.org
> <javascript:;>>
> > escribió:
> >
> > > What specific version of 0.94 are you using?
> > >
> > > In general, if you have multiple spindles (disks) and/or multiple CPU
> > > cores at the region server you should benefits from keeping multiple
> > region
> > > server handler threads busy. I have experimented with this before and
> > saw a
> > > close to linear speed up (up to the point where all disks/core were
> > busy).
> > > Obviously this also assuming this is the only load you throw at the
> > servers
> > > at this point.
> > >
> > > Can you post your complete code to pastebin? Maybe even with some code
> to
> > > seed the data?
> > > How do you run your callables? Did you configure the ExecuteService
> > > correctly (assuming you use one to run your callables)?
> > >
> > > Then we can run it and have a look.
> > >
> > > Thanks.
> > >
> > > -- Lars
> > >
> > >
> > > ----- Original Message -----
> > > From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>
> <javascript:;>>
> > > To: "user@hbase.apache.org <javascript:;> <javascript:;>" <
> user@hbase.apache.org <javascript:;>
> > > <javascript:;>>
> > > Cc:
> > > Sent: Saturday, September 13, 2014 4:49 PM
> > > Subject: Re: Scan vs Parallel scan.
> > >
> > > What am I missing??
> > >
> > >
> > >
> > >
> > > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>
> > > <javascript:;>>:
> > >
> > > > For an partial scan, I guess that I call to the RS to get data, it
> > starts
> > > > looking in the store files and recollecting the data. (It doesn't
> write
> > > to
> > > > the blockcache in both cases). It has ready the data and it gives to
> > the
> > > > client the data step by step, I mean,,, it depends the caching and
> > > batching
> > > > parameters.
> > > >
> > > > Big differences that I see...
> > > > I'm opening more connections to the Table, one for Region.
> > > >
> > > > I should check the single table scan, it looks like it does partial
> > scans
> > > > sequentially. Since you can see on the HBase Master how the request
> > > > increase one after another, not all in the same time.
> > > >
> > > > 2014-09-12 15:23 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>
> > > <javascript:;>>:
> > > >
> > > >> It doesn’t matter which RS, but that you have 1 thread for each
> > region.
> > > >>
> > > >> So for each thread, what’s happening.
> > > >> Step by step, what is the code doing.
> > > >>
> > > >> Now you’re comparing this against a single table scan, right?
> > > >> What’s happening in the table scan…?
> > > >>
> > > >>
> > > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>
> > > <javascript:;>>
> > > >> wrote:
> > > >>
> > > >> > Right, My table for example has keys between 0-9. in three regions
> > > >> > 0-2,3-7,7-9
> > > >> > I lauch three partial scans in parallel. The scans that I'm
> > executing
> > > >> are:
> > > >> > scan(0,2), scan(3,7), scan(7,9).
> > > >> > Each region is if a different RS, so each thread goes to different
> > RS.
> > > >> It's
> > > >> > not exactly like that, but on the benchmark case it's like it's
> > > working.
> > > >> >
> > > >> > Really the code will execute a thread for each Region not for each
> > > >> > RegionServer. But in the test I only have two regions for
> > > regionServer.
> > > >> I
> > > >> > dont' think that's an important point, there're two threads for
> RS.
> > > >> >
> > > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <
> michael_segel@hotmail.com <javascript:;>
> > > <javascript:;>>:
> > > >> >
> > > >> >> Ok, lets again take a step back…
> > > >> >>
> > > >> >> So you are comparing your partial scan(s) against a full table
> > scan?
> > > >> >>
> > > >> >> If I understood your question, you launch 3 partial scans where
> you
> > > set
> > > >> >> the start row and then end row of each scan, right?
> > > >> >>
> > > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <
> konstt2000@gmail.com <javascript:;>
> > > <javascript:;>>
> > > >> wrote:
> > > >> >>
> > > >> >>> Okay, then, the partial scan doesn't work as I think.
> > > >> >>> How could it exceed the limit of a single region if I calculate
> > the
> > > >> >> limits?
> > > >> >>>
> > > >> >>>
> > > >> >>> The only bad point that I see it's that If a region server has
> > three
> > > >> >>> regions of the same table,  I'm executing three partial scans
> > about
> > > >> this
> > > >> >> RS
> > > >> >>> and they could compete for resources (network, etc..) on this
> > node.
> > > >> It'd
> > > >> >> be
> > > >> >>> better to have one thread for RS. But, that doesn't answer your
> > > >> >> questions.
> > > >> >>>
> > > >> >>> I keep thinking...
> > > >> >>>
> > > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <
> > michael_segel@hotmail.com <javascript:;>
> > > <javascript:;>>:
> > > >> >>>
> > > >> >>>> Hi,
> > > >> >>>>
> > > >> >>>> I wanted to take a step back from the actual code and to stop
> and
> > > >> think
> > > >> >>>> about what you are doing and what HBase is doing under the
> > covers.
> > > >> >>>>
> > > >> >>>> So in your code, you are asking HBase to do 3 separate scans
> and
> > > then
> > > >> >> you
> > > >> >>>> take the result set back and join it.
> > > >> >>>>
> > > >> >>>> What does HBase do when it does a range scan?
> > > >> >>>> What happens when that range scan exceeds a single region?
> > > >> >>>>
> > > >> >>>> If you answer those questions… you’ll have your answer.
> > > >> >>>>
> > > >> >>>> HTH
> > > >> >>>>
> > > >> >>>> -Mike
> > > >> >>>>
> > > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <
> > konstt2000@gmail.com <javascript:;>
> > > <javascript:;>>
> > > >> >> wrote:
> > > >> >>>>
> > > >> >>>>> It's not all the code, I set things like these as well:
> > > >> >>>>> scan.setMaxVersions();
> > > >> >>>>> scan.setCacheBlocks(false);
> > > >> >>>>> ...
> > > >> >>>>>
> > > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <
> konstt2000@gmail.com <javascript:;>
> > > <javascript:;>>:
> > > >> >>>>>
> > > >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> > > >> >>>>>>
> > > >> >>>>>> I got the start and stop keys with this method:
> > > >> >>>>>> private List<RegionScanner> generatePartitions() {
> > > >> >>>>>>      List<RegionScanner> regionScanners = new
> > > >> >>>>>> ArrayList<RegionScanner>();
> > > >> >>>>>>      byte[] startKey;
> > > >> >>>>>>      byte[] stopKey;
> > > >> >>>>>>      HConnection connection = null;
> > > >> >>>>>>      HBaseAdmin hbaseAdmin = null;
> > > >> >>>>>>      try {
> > > >> >>>>>>          connection = HConnectionManager.
> > > >> >>>>>> createConnection(HBaseConfiguration.create());
> > > >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> > > >> >>>>>>          List<HRegionInfo> regions =
> > > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > > >> >>>>>>          RegionScanner regionScanner = null;
> > > >> >>>>>>          for (HRegionInfo region : regions) {
> > > >> >>>>>>
> > > >> >>>>>>              startKey = region.getStartKey();
> > > >> >>>>>>              stopKey = region.getEndKey();
> > > >> >>>>>>
> > > >> >>>>>>              regionScanner = new RegionScanner(startKey,
> > stopKey,
> > > >> >>>>>> scanConfiguration);
> > > >> >>>>>>              // regionScanner = createRegionScanner(startKey,
> > > >> >>>> stopKey);
> > > >> >>>>>>              if (regionScanner != null) {
> > > >> >>>>>>                  regionScanners.add(regionScanner);
> > > >> >>>>>>              }
> > > >> >>>>>>          }
> > > >> >>>>>>
> > > >> >>>>>> And I execute the RegionScanner with this:
> > > >> >>>>>> public List<Result> call() throws Exception {
> > > >> >>>>>>      HConnection connection =
> > > >> >>>>>>
> > HConnectionManager.createConnection(HBaseConfiguration.create());
> > > >> >>>>>>      HTableInterface table =
> > > >> >>>>>> connection.getTable(configuration.getTable());
> > > >> >>>>>>
> > > >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> > > >> >>>>>>      scan.setBatch(configuration.getBatch());
> > > >> >>>>>>      scan.setCaching(configuration.getCaching());
> > > >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> > > >> >>>>>>
> > > >> >>>>>>      List<Result> results = new ArrayList<Result>();
> > > >> >>>>>>      for (Result result : resultScanner) {
> > > >> >>>>>>          results.add(result);
> > > >> >>>>>>      }
> > > >> >>>>>>
> > > >> >>>>>>      connection.close();
> > > >> >>>>>>      table.close();
> > > >> >>>>>>
> > > >> >>>>>>      return results;
> > > >> >>>>>>  }
> > > >> >>>>>>
> > > >> >>>>>> They implement Callable.
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> > > michael_segel@hotmail.com <javascript:;> <javascript:;>
> > > >> >:
> > > >> >>>>>>
> > > >> >>>>>>> Lets take a step back….
> > > >> >>>>>>>
> > > >> >>>>>>> Your parallel scan is having the client create N threads
> where
> > > in
> > > >> >> each
> > > >> >>>>>>> thread, you’re doing a partial scan of the table where each
> > > >> partial
> > > >> >>>> scan
> > > >> >>>>>>> takes the first and last row of each region?
> > > >> >>>>>>>
> > > >> >>>>>>> Is that correct?
> > > >> >>>>>>>
> > > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> > > >> konstt2000@gmail.com <javascript:;> <javascript:;>>
> > > >> >>>>>>> wrote:
> > > >> >>>>>>>
> > > >> >>>>>>>> I was checking a little bit more about,, I checked the
> > cluster
> > > >> and
> > > >> >>>> data
> > > >> >>>>>>> is
> > > >> >>>>>>>> store in three different regions servers, each one in a
> > > >> differente
> > > >> >>>> node.
> > > >> >>>>>>>> So, I guess the threads go to different hard-disks.
> > > >> >>>>>>>>
> > > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a
> > single
> > > >> scan
> > > >> >>>>>>> than
> > > >> >>>>>>>> this implementation. I based on this implementation
> > > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> > > >> >>>>>>>>
> > > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> > > konstt2000@gmail.com <javascript:;> <javascript:;>
> > > >> >:
> > > >> >>>>>>>>
> > > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> > > 0.98,
> > > >> >>>>>>> although
> > > >> >>>>>>>>> there is not difference.
> > > >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> > > family
> > > >> >> and
> > > >> >>>> I
> > > >> >>>>>>> put
> > > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I think that it's not possible that I executing an
> complete
> > > scan
> > > >> >> for
> > > >> >>>>>>> each
> > > >> >>>>>>>>> thread since my data are the type:
> > > >> >>>>>>>>> 000001 f:q value=1
> > > >> >>>>>>>>> 000002 f:q value=2
> > > >> >>>>>>>>> 000003 f:q value=3
> > > >> >>>>>>>>> ...
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I add all the values and get the same result on a single
> > scan
> > > >> than
> > > >> >> a
> > > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> > > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> > > don't
> > > >> >>>>>>> remember,
> > > >> >>>>>>>>> but like 4x  of the scan time.
> > > >> >>>>>>>>> I'm not using any filter for the scans.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> This is the way I calculate number of regions/scans
> > > >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> > > >> >>>>>>>>>     List<RegionScanner> regionScanners = new
> > > >> >>>>>>>>> ArrayList<RegionScanner>();
> > > >> >>>>>>>>>     byte[] startKey;
> > > >> >>>>>>>>>     byte[] stopKey;
> > > >> >>>>>>>>>     HConnection connection = null;
> > > >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> > > >> >>>>>>>>>     try {
> > > >> >>>>>>>>>         connection =
> > > >> >>>>>>>>>
> > > >> HConnectionManager.createConnection(HBaseConfiguration.create());
> > > >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> > > >> >>>>>>>>>         List<HRegionInfo> regions =
> > > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > > >> >>>>>>>>>         RegionScanner regionScanner = null;
> > > >> >>>>>>>>>         for (HRegionInfo region : regions) {
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>             startKey = region.getStartKey();
> > > >> >>>>>>>>>             stopKey = region.getEndKey();
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
> > > stopKey,
> > > >> >>>>>>>>> scanConfiguration);
> > > >> >>>>>>>>>             // regionScanner =
> createRegionScanner(startKey,
> > > >> >>>>>>> stopKey);
> > > >> >>>>>>>>>             if (regionScanner != null) {
> > > >> >>>>>>>>>                 regionScanners.add(regionScanner);
> > > >> >>>>>>>>>             }
> > > >> >>>>>>>>>         }
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I did some test for a tiny table and I think that the
> range
> > > for
> > > >> >> each
> > > >> >>>>>>> scan
> > > >> >>>>>>>>> works fine. Although, I though that it was interesting
> that
> > > the
> > > >> >> time
> > > >> >>>>>>> when I
> > > >> >>>>>>>>> execute distributed scan is about 6x.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> I'm going to check about the hard disks, but I think that
> > ti's
> > > >> >> right.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org
> <javascript:;>
> > > <javascript:;>>:
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>> Which version of HBase?
> > > >> >>>>>>>>>> Can you show us the code?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as
> long
> > as
> > > >> the
> > > >> >>>>>>> single
> > > >> >>>>>>>>>> scan, which is suspicious because you say you have 6
> > regions.
> > > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the
> data
> > in
> > > >> each
> > > >> >>>> of
> > > >> >>>>>>>>>> your parallel scans?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> -- Lars
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> ________________________________
> > > >> >>>>>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>
> > <javascript:;>>
> > > >> >>>>>>>>>> To: "user@hbase.apache.org <javascript:;>
> <javascript:;>" <
> > > user@hbase.apache.org <javascript:;> <javascript:;>>
> > > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> > > >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Hi,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> I developed an distributed scan, I create an thread for
> > each
> > > >> >> region.
> > > >> >>>>>>> After
> > > >> >>>>>>>>>> that, I've tried to get some times Scan vs
> DistributedScan.
> > > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> > > region
> > > >> >>>>>>> servers
> > > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows
> > and
> > > >> >>>> execute a
> > > >> >>>>>>>>>> complete scan.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> My partitions are
> > > >> >>>>>>>>>> -01666 -> request 16665
> > > >> >>>>>>>>>> 016666-033332 -> request 16666
> > > >> >>>>>>>>>> 033332-049998 -> request 16666
> > > >> >>>>>>>>>> 049998-066664 -> request 16666
> > > >> >>>>>>>>>> 066664-083330 -> request 16666
> > > >> >>>>>>>>>> 083330- -> request 16671
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 10
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 100
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> > > >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> > > >> >>>>>>>>>> Caching 1000
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:68288ms,Counter:2
> > > >> >>>>>>> ->
> > > >> >>>>>>>>>> Caching 1
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:2646ms,Counter:2
> > > >> >>>> ->
> > > >> >>>>>>>>>> Caching 100
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS
> 100000
> > > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> > > >> >> NORMAL:3903ms,Counter:2
> > > >> >>>> ->
> > > >> >>>>>>>>>> Caching 1000
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I
> > don't
> > > >> know
> > > >> >>>> why
> > > >> >>>>>>>>>> it's
> > > >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> > > from
> > > >> >> hbase
> > > >> >>>>>>>>>> shell,
> > > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> > > works
> > > >> >>>> better
> > > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Any clue about it?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Scan vs Parallel scan.

Posted by Anoop John <an...@gmail.com>.

Again full code snippet can better speak.

But not getting what u r doing with below code

private List<RegionScanner> generatePartitions() {
        List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
        byte[] startKey;
        byte[] stopKey;
        HConnection connection = null;
        HBaseAdmin hbaseAdmin = null;
        try {
            connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
            hbaseAdmin = new HBaseAdmin(connection);
            List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
            RegionScanner regionScanner = null;
            for (HRegionInfo region : regions) {

                startKey = region.getStartKey();
                stopKey = region.getEndKey();

                regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
                // regionScanner = createRegionScanner(startKey, stopKey);
                if (regionScanner != null) {
                    regionScanners.add(regionScanner);
                }
            }

And I execute the RegionScanner with this:
public List<Result> call() throws Exception {
        HConnection connection =
HConnectionManager.
createConnection(HBaseConfiguration.create());
        HTableInterface table =
connection.getTable(configuration.getTable());

    Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);


What is this part?
new RegionScanner(startKey, stopKey,
scanConfiguration);


>>Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.
getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);


And not setting start and stop rows to this Scan object? !!


Sorry If I missed some parts from ur code.

-Anoop-


On Sun, Sep 14, 2014 at 2:54 PM, Guillermo Ortiz <ko...@gmail.com>
wrote:

> I don't have the code here,, but I'll put the code in a couple of days. I
> have to check the executeservice again! I don't remember exactly how I did.
>
> I'm using Hbase 0.98.
>
> El domingo, 14 de septiembre de 2014, lars hofhansl <la...@apache.org>
> escribió:
>
> > What specific version of 0.94 are you using?
> >
> > In general, if you have multiple spindles (disks) and/or multiple CPU
> > cores at the region server you should benefits from keeping multiple
> region
> > server handler threads busy. I have experimented with this before and
> saw a
> > close to linear speed up (up to the point where all disks/core were
> busy).
> > Obviously this also assuming this is the only load you throw at the
> servers
> > at this point.
> >
> > Can you post your complete code to pastebin? Maybe even with some code to
> > seed the data?
> > How do you run your callables? Did you configure the ExecuteService
> > correctly (assuming you use one to run your callables)?
> >
> > Then we can run it and have a look.
> >
> > Thanks.
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
> > To: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org
> > <javascript:;>>
> > Cc:
> > Sent: Saturday, September 13, 2014 4:49 PM
> > Subject: Re: Scan vs Parallel scan.
> >
> > What am I missing??
> >
> >
> >
> >
> > 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>:
> >
> > > For an partial scan, I guess that I call to the RS to get data, it
> starts
> > > looking in the store files and recollecting the data. (It doesn't write
> > to
> > > the blockcache in both cases). It has ready the data and it gives to
> the
> > > client the data step by step, I mean,,, it depends the caching and
> > batching
> > > parameters.
> > >
> > > Big differences that I see...
> > > I'm opening more connections to the Table, one for Region.
> > >
> > > I should check the single table scan, it looks like it does partial
> scans
> > > sequentially. Since you can see on the HBase Master how the request
> > > increase one after another, not all in the same time.
> > >
> > > 2014-09-12 15:23 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> > <javascript:;>>:
> > >
> > >> It doesn’t matter which RS, but that you have 1 thread for each
> region.
> > >>
> > >> So for each thread, what’s happening.
> > >> Step by step, what is the code doing.
> > >>
> > >> Now you’re comparing this against a single table scan, right?
> > >> What’s happening in the table scan…?
> > >>
> > >>
> > >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >>
> > >> > Right, My table for example has keys between 0-9. in three regions
> > >> > 0-2,3-7,7-9
> > >> > I lauch three partial scans in parallel. The scans that I'm
> executing
> > >> are:
> > >> > scan(0,2), scan(3,7), scan(7,9).
> > >> > Each region is if a different RS, so each thread goes to different
> RS.
> > >> It's
> > >> > not exactly like that, but on the benchmark case it's like it's
> > working.
> > >> >
> > >> > Really the code will execute a thread for each Region not for each
> > >> > RegionServer. But in the test I only have two regions for
> > regionServer.
> > >> I
> > >> > dont' think that's an important point, there're two threads for RS.
> > >> >
> > >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> > <javascript:;>>:
> > >> >
> > >> >> Ok, lets again take a step back…
> > >> >>
> > >> >> So you are comparing your partial scan(s) against a full table
> scan?
> > >> >>
> > >> >> If I understood your question, you launch 3 partial scans where you
> > set
> > >> >> the start row and then end row of each scan, right?
> > >> >>
> > >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >> >>
> > >> >>> Okay, then, the partial scan doesn't work as I think.
> > >> >>> How could it exceed the limit of a single region if I calculate
> the
> > >> >> limits?
> > >> >>>
> > >> >>>
> > >> >>> The only bad point that I see it's that If a region server has
> three
> > >> >>> regions of the same table,  I'm executing three partial scans
> about
> > >> this
> > >> >> RS
> > >> >>> and they could compete for resources (network, etc..) on this
> node.
> > >> It'd
> > >> >> be
> > >> >>> better to have one thread for RS. But, that doesn't answer your
> > >> >> questions.
> > >> >>>
> > >> >>> I keep thinking...
> > >> >>>
> > >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <
> michael_segel@hotmail.com
> > <javascript:;>>:
> > >> >>>
> > >> >>>> Hi,
> > >> >>>>
> > >> >>>> I wanted to take a step back from the actual code and to stop and
> > >> think
> > >> >>>> about what you are doing and what HBase is doing under the
> covers.
> > >> >>>>
> > >> >>>> So in your code, you are asking HBase to do 3 separate scans and
> > then
> > >> >> you
> > >> >>>> take the result set back and join it.
> > >> >>>>
> > >> >>>> What does HBase do when it does a range scan?
> > >> >>>> What happens when that range scan exceeds a single region?
> > >> >>>>
> > >> >>>> If you answer those questions… you’ll have your answer.
> > >> >>>>
> > >> >>>> HTH
> > >> >>>>
> > >> >>>> -Mike
> > >> >>>>
> > >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <
> konstt2000@gmail.com
> > <javascript:;>>
> > >> >> wrote:
> > >> >>>>
> > >> >>>>> It's not all the code, I set things like these as well:
> > >> >>>>> scan.setMaxVersions();
> > >> >>>>> scan.setCacheBlocks(false);
> > >> >>>>> ...
> > >> >>>>>
> > >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> > <javascript:;>>:
> > >> >>>>>
> > >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> > >> >>>>>>
> > >> >>>>>> I got the start and stop keys with this method:
> > >> >>>>>> private List<RegionScanner> generatePartitions() {
> > >> >>>>>>      List<RegionScanner> regionScanners = new
> > >> >>>>>> ArrayList<RegionScanner>();
> > >> >>>>>>      byte[] startKey;
> > >> >>>>>>      byte[] stopKey;
> > >> >>>>>>      HConnection connection = null;
> > >> >>>>>>      HBaseAdmin hbaseAdmin = null;
> > >> >>>>>>      try {
> > >> >>>>>>          connection = HConnectionManager.
> > >> >>>>>> createConnection(HBaseConfiguration.create());
> > >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> > >> >>>>>>          List<HRegionInfo> regions =
> > >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > >> >>>>>>          RegionScanner regionScanner = null;
> > >> >>>>>>          for (HRegionInfo region : regions) {
> > >> >>>>>>
> > >> >>>>>>              startKey = region.getStartKey();
> > >> >>>>>>              stopKey = region.getEndKey();
> > >> >>>>>>
> > >> >>>>>>              regionScanner = new RegionScanner(startKey,
> stopKey,
> > >> >>>>>> scanConfiguration);
> > >> >>>>>>              // regionScanner = createRegionScanner(startKey,
> > >> >>>> stopKey);
> > >> >>>>>>              if (regionScanner != null) {
> > >> >>>>>>                  regionScanners.add(regionScanner);
> > >> >>>>>>              }
> > >> >>>>>>          }
> > >> >>>>>>
> > >> >>>>>> And I execute the RegionScanner with this:
> > >> >>>>>> public List<Result> call() throws Exception {
> > >> >>>>>>      HConnection connection =
> > >> >>>>>>
> HConnectionManager.createConnection(HBaseConfiguration.create());
> > >> >>>>>>      HTableInterface table =
> > >> >>>>>> connection.getTable(configuration.getTable());
> > >> >>>>>>
> > >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> > >> >>>>>>      scan.setBatch(configuration.getBatch());
> > >> >>>>>>      scan.setCaching(configuration.getCaching());
> > >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> > >> >>>>>>
> > >> >>>>>>      List<Result> results = new ArrayList<Result>();
> > >> >>>>>>      for (Result result : resultScanner) {
> > >> >>>>>>          results.add(result);
> > >> >>>>>>      }
> > >> >>>>>>
> > >> >>>>>>      connection.close();
> > >> >>>>>>      table.close();
> > >> >>>>>>
> > >> >>>>>>      return results;
> > >> >>>>>>  }
> > >> >>>>>>
> > >> >>>>>> They implement Callable.
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> > michael_segel@hotmail.com <javascript:;>
> > >> >:
> > >> >>>>>>
> > >> >>>>>>> Lets take a step back….
> > >> >>>>>>>
> > >> >>>>>>> Your parallel scan is having the client create N threads where
> > in
> > >> >> each
> > >> >>>>>>> thread, you’re doing a partial scan of the table where each
> > >> partial
> > >> >>>> scan
> > >> >>>>>>> takes the first and last row of each region?
> > >> >>>>>>>
> > >> >>>>>>> Is that correct?
> > >> >>>>>>>
> > >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> > >> konstt2000@gmail.com <javascript:;>>
> > >> >>>>>>> wrote:
> > >> >>>>>>>
> > >> >>>>>>>> I was checking a little bit more about,, I checked the
> cluster
> > >> and
> > >> >>>> data
> > >> >>>>>>> is
> > >> >>>>>>>> store in three different regions servers, each one in a
> > >> differente
> > >> >>>> node.
> > >> >>>>>>>> So, I guess the threads go to different hard-disks.
> > >> >>>>>>>>
> > >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a
> single
> > >> scan
> > >> >>>>>>> than
> > >> >>>>>>>> this implementation. I based on this implementation
> > >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> > >> >>>>>>>>
> > >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> > konstt2000@gmail.com <javascript:;>
> > >> >:
> > >> >>>>>>>>
> > >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> > 0.98,
> > >> >>>>>>> although
> > >> >>>>>>>>> there is not difference.
> > >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> > family
> > >> >> and
> > >> >>>> I
> > >> >>>>>>> put
> > >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> > >> >>>>>>>>>
> > >> >>>>>>>>> I think that it's not possible that I executing an complete
> > scan
> > >> >> for
> > >> >>>>>>> each
> > >> >>>>>>>>> thread since my data are the type:
> > >> >>>>>>>>> 000001 f:q value=1
> > >> >>>>>>>>> 000002 f:q value=2
> > >> >>>>>>>>> 000003 f:q value=3
> > >> >>>>>>>>> ...
> > >> >>>>>>>>>
> > >> >>>>>>>>> I add all the values and get the same result on a single
> scan
> > >> than
> > >> >> a
> > >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> > >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> > don't
> > >> >>>>>>> remember,
> > >> >>>>>>>>> but like 4x  of the scan time.
> > >> >>>>>>>>> I'm not using any filter for the scans.
> > >> >>>>>>>>>
> > >> >>>>>>>>> This is the way I calculate number of regions/scans
> > >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> > >> >>>>>>>>>     List<RegionScanner> regionScanners = new
> > >> >>>>>>>>> ArrayList<RegionScanner>();
> > >> >>>>>>>>>     byte[] startKey;
> > >> >>>>>>>>>     byte[] stopKey;
> > >> >>>>>>>>>     HConnection connection = null;
> > >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> > >> >>>>>>>>>     try {
> > >> >>>>>>>>>         connection =
> > >> >>>>>>>>>
> > >> HConnectionManager.createConnection(HBaseConfiguration.create());
> > >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> > >> >>>>>>>>>         List<HRegionInfo> regions =
> > >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> > >> >>>>>>>>>         RegionScanner regionScanner = null;
> > >> >>>>>>>>>         for (HRegionInfo region : regions) {
> > >> >>>>>>>>>
> > >> >>>>>>>>>             startKey = region.getStartKey();
> > >> >>>>>>>>>             stopKey = region.getEndKey();
> > >> >>>>>>>>>
> > >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
> > stopKey,
> > >> >>>>>>>>> scanConfiguration);
> > >> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
> > >> >>>>>>> stopKey);
> > >> >>>>>>>>>             if (regionScanner != null) {
> > >> >>>>>>>>>                 regionScanners.add(regionScanner);
> > >> >>>>>>>>>             }
> > >> >>>>>>>>>         }
> > >> >>>>>>>>>
> > >> >>>>>>>>> I did some test for a tiny table and I think that the range
> > for
> > >> >> each
> > >> >>>>>>> scan
> > >> >>>>>>>>> works fine. Although, I though that it was interesting that
> > the
> > >> >> time
> > >> >>>>>>> when I
> > >> >>>>>>>>> execute distributed scan is about 6x.
> > >> >>>>>>>>>
> > >> >>>>>>>>> I'm going to check about the hard disks, but I think that
> ti's
> > >> >> right.
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org
> > <javascript:;>>:
> > >> >>>>>>>>>
> > >> >>>>>>>>>> Which version of HBase?
> > >> >>>>>>>>>> Can you show us the code?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long
> as
> > >> the
> > >> >>>>>>> single
> > >> >>>>>>>>>> scan, which is suspicious because you say you have 6
> regions.
> > >> >>>>>>>>>> Are you sure you're not accidentally scanning all the data
> in
> > >> each
> > >> >>>> of
> > >> >>>>>>>>>> your parallel scans?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> -- Lars
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> ________________________________
> > >> >>>>>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> > >> >>>>>>>>>> To: "user@hbase.apache.org <javascript:;>" <
> > user@hbase.apache.org <javascript:;>>
> > >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> > >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Hi,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> I developed an distributed scan, I create an thread for
> each
> > >> >> region.
> > >> >>>>>>> After
> > >> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> > >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> > region
> > >> >>>>>>> servers
> > >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows
> and
> > >> >>>> execute a
> > >> >>>>>>>>>> complete scan.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> My partitions are
> > >> >>>>>>>>>> -01666 -> request 16665
> > >> >>>>>>>>>> 016666-033332 -> request 16666
> > >> >>>>>>>>>> 033332-049998 -> request 16666
> > >> >>>>>>>>>> 049998-066664 -> request 16666
> > >> >>>>>>>>>> 066664-083330 -> request 16666
> > >> >>>>>>>>>> 083330- -> request 16671
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 10
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 100
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> > >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> > >> >>>>>>>>>> Caching 1000
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:68288ms,Counter:2
> > >> >>>>>>> ->
> > >> >>>>>>>>>> Caching 1
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:2646ms,Counter:2
> > >> >>>> ->
> > >> >>>>>>>>>> Caching 100
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> > >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> > >> >> NORMAL:3903ms,Counter:2
> > >> >>>> ->
> > >> >>>>>>>>>> Caching 1000
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I
> don't
> > >> know
> > >> >>>> why
> > >> >>>>>>>>>> it's
> > >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> > from
> > >> >> hbase
> > >> >>>>>>>>>> shell,
> > >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> > works
> > >> >>>> better
> > >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Any clue about it?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>
> > >> >>>>
> > >> >>>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >
> >
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I don't have the code here,, but I'll put the code in a couple of days. I
have to check the executeservice again! I don't remember exactly how I did.

I'm using Hbase 0.98.

El domingo, 14 de septiembre de 2014, lars hofhansl <la...@apache.org>
escribió:

> What specific version of 0.94 are you using?
>
> In general, if you have multiple spindles (disks) and/or multiple CPU
> cores at the region server you should benefits from keeping multiple region
> server handler threads busy. I have experimented with this before and saw a
> close to linear speed up (up to the point where all disks/core were busy).
> Obviously this also assuming this is the only load you throw at the servers
> at this point.
>
> Can you post your complete code to pastebin? Maybe even with some code to
> seed the data?
> How do you run your callables? Did you configure the ExecuteService
> correctly (assuming you use one to run your callables)?
>
> Then we can run it and have a look.
>
> Thanks.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
> To: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org
> <javascript:;>>
> Cc:
> Sent: Saturday, September 13, 2014 4:49 PM
> Subject: Re: Scan vs Parallel scan.
>
> What am I missing??
>
>
>
>
> 2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>:
>
> > For an partial scan, I guess that I call to the RS to get data, it starts
> > looking in the store files and recollecting the data. (It doesn't write
> to
> > the blockcache in both cases). It has ready the data and it gives to the
> > client the data step by step, I mean,,, it depends the caching and
> batching
> > parameters.
> >
> > Big differences that I see...
> > I'm opening more connections to the Table, one for Region.
> >
> > I should check the single table scan, it looks like it does partial scans
> > sequentially. Since you can see on the HBase Master how the request
> > increase one after another, not all in the same time.
> >
> > 2014-09-12 15:23 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >
> >> It doesn’t matter which RS, but that you have 1 thread for each region.
> >>
> >> So for each thread, what’s happening.
> >> Step by step, what is the code doing.
> >>
> >> Now you’re comparing this against a single table scan, right?
> >> What’s happening in the table scan…?
> >>
> >>
> >> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >> > Right, My table for example has keys between 0-9. in three regions
> >> > 0-2,3-7,7-9
> >> > I lauch three partial scans in parallel. The scans that I'm executing
> >> are:
> >> > scan(0,2), scan(3,7), scan(7,9).
> >> > Each region is if a different RS, so each thread goes to different RS.
> >> It's
> >> > not exactly like that, but on the benchmark case it's like it's
> working.
> >> >
> >> > Really the code will execute a thread for each Region not for each
> >> > RegionServer. But in the test I only have two regions for
> regionServer.
> >> I
> >> > dont' think that's an important point, there're two threads for RS.
> >> >
> >> > 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >> >
> >> >> Ok, lets again take a step back…
> >> >>
> >> >> So you are comparing your partial scan(s) against a full table scan?
> >> >>
> >> >> If I understood your question, you launch 3 partial scans where you
> set
> >> >> the start row and then end row of each scan, right?
> >> >>
> >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> >> wrote:
> >> >>
> >> >>> Okay, then, the partial scan doesn't work as I think.
> >> >>> How could it exceed the limit of a single region if I calculate the
> >> >> limits?
> >> >>>
> >> >>>
> >> >>> The only bad point that I see it's that If a region server has three
> >> >>> regions of the same table,  I'm executing three partial scans about
> >> this
> >> >> RS
> >> >>> and they could compete for resources (network, etc..) on this node.
> >> It'd
> >> >> be
> >> >>> better to have one thread for RS. But, that doesn't answer your
> >> >> questions.
> >> >>>
> >> >>> I keep thinking...
> >> >>>
> >> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> <javascript:;>>:
> >> >>>
> >> >>>> Hi,
> >> >>>>
> >> >>>> I wanted to take a step back from the actual code and to stop and
> >> think
> >> >>>> about what you are doing and what HBase is doing under the covers.
> >> >>>>
> >> >>>> So in your code, you are asking HBase to do 3 separate scans and
> then
> >> >> you
> >> >>>> take the result set back and join it.
> >> >>>>
> >> >>>> What does HBase do when it does a range scan?
> >> >>>> What happens when that range scan exceeds a single region?
> >> >>>>
> >> >>>> If you answer those questions… you’ll have your answer.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>> -Mike
> >> >>>>
> >> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>
> >> >> wrote:
> >> >>>>
> >> >>>>> It's not all the code, I set things like these as well:
> >> >>>>> scan.setMaxVersions();
> >> >>>>> scan.setCacheBlocks(false);
> >> >>>>> ...
> >> >>>>>
> >> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> <javascript:;>>:
> >> >>>>>
> >> >>>>>> yes, that is. I have changed the HBase version to 0.98
> >> >>>>>>
> >> >>>>>> I got the start and stop keys with this method:
> >> >>>>>> private List<RegionScanner> generatePartitions() {
> >> >>>>>>      List<RegionScanner> regionScanners = new
> >> >>>>>> ArrayList<RegionScanner>();
> >> >>>>>>      byte[] startKey;
> >> >>>>>>      byte[] stopKey;
> >> >>>>>>      HConnection connection = null;
> >> >>>>>>      HBaseAdmin hbaseAdmin = null;
> >> >>>>>>      try {
> >> >>>>>>          connection = HConnectionManager.
> >> >>>>>> createConnection(HBaseConfiguration.create());
> >> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> >> >>>>>>          List<HRegionInfo> regions =
> >> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >> >>>>>>          RegionScanner regionScanner = null;
> >> >>>>>>          for (HRegionInfo region : regions) {
> >> >>>>>>
> >> >>>>>>              startKey = region.getStartKey();
> >> >>>>>>              stopKey = region.getEndKey();
> >> >>>>>>
> >> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
> >> >>>>>> scanConfiguration);
> >> >>>>>>              // regionScanner = createRegionScanner(startKey,
> >> >>>> stopKey);
> >> >>>>>>              if (regionScanner != null) {
> >> >>>>>>                  regionScanners.add(regionScanner);
> >> >>>>>>              }
> >> >>>>>>          }
> >> >>>>>>
> >> >>>>>> And I execute the RegionScanner with this:
> >> >>>>>> public List<Result> call() throws Exception {
> >> >>>>>>      HConnection connection =
> >> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >> >>>>>>      HTableInterface table =
> >> >>>>>> connection.getTable(configuration.getTable());
> >> >>>>>>
> >> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> >> >>>>>>      scan.setBatch(configuration.getBatch());
> >> >>>>>>      scan.setCaching(configuration.getCaching());
> >> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> >> >>>>>>
> >> >>>>>>      List<Result> results = new ArrayList<Result>();
> >> >>>>>>      for (Result result : resultScanner) {
> >> >>>>>>          results.add(result);
> >> >>>>>>      }
> >> >>>>>>
> >> >>>>>>      connection.close();
> >> >>>>>>      table.close();
> >> >>>>>>
> >> >>>>>>      return results;
> >> >>>>>>  }
> >> >>>>>>
> >> >>>>>> They implement Callable.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <
> michael_segel@hotmail.com <javascript:;>
> >> >:
> >> >>>>>>
> >> >>>>>>> Lets take a step back….
> >> >>>>>>>
> >> >>>>>>> Your parallel scan is having the client create N threads where
> in
> >> >> each
> >> >>>>>>> thread, you’re doing a partial scan of the table where each
> >> partial
> >> >>>> scan
> >> >>>>>>> takes the first and last row of each region?
> >> >>>>>>>
> >> >>>>>>> Is that correct?
> >> >>>>>>>
> >> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
> >> konstt2000@gmail.com <javascript:;>>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> I was checking a little bit more about,, I checked the cluster
> >> and
> >> >>>> data
> >> >>>>>>> is
> >> >>>>>>>> store in three different regions servers, each one in a
> >> differente
> >> >>>> node.
> >> >>>>>>>> So, I guess the threads go to different hard-disks.
> >> >>>>>>>>
> >> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
> >> scan
> >> >>>>>>> than
> >> >>>>>>>> this implementation. I based on this implementation
> >> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >> >>>>>>>>
> >> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <
> konstt2000@gmail.com <javascript:;>
> >> >:
> >> >>>>>>>>
> >> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with
> 0.98,
> >> >>>>>>> although
> >> >>>>>>>>> there is not difference.
> >> >>>>>>>>> I disabled the table and disabled the blockcache for that
> family
> >> >> and
> >> >>>> I
> >> >>>>>>> put
> >> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> >> >>>>>>>>>
> >> >>>>>>>>> I think that it's not possible that I executing an complete
> scan
> >> >> for
> >> >>>>>>> each
> >> >>>>>>>>> thread since my data are the type:
> >> >>>>>>>>> 000001 f:q value=1
> >> >>>>>>>>> 000002 f:q value=2
> >> >>>>>>>>> 000003 f:q value=3
> >> >>>>>>>>> ...
> >> >>>>>>>>>
> >> >>>>>>>>> I add all the values and get the same result on a single scan
> >> than
> >> >> a
> >> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> >> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I
> don't
> >> >>>>>>> remember,
> >> >>>>>>>>> but like 4x  of the scan time.
> >> >>>>>>>>> I'm not using any filter for the scans.
> >> >>>>>>>>>
> >> >>>>>>>>> This is the way I calculate number of regions/scans
> >> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> >> >>>>>>>>>     List<RegionScanner> regionScanners = new
> >> >>>>>>>>> ArrayList<RegionScanner>();
> >> >>>>>>>>>     byte[] startKey;
> >> >>>>>>>>>     byte[] stopKey;
> >> >>>>>>>>>     HConnection connection = null;
> >> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> >> >>>>>>>>>     try {
> >> >>>>>>>>>         connection =
> >> >>>>>>>>>
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> >> >>>>>>>>>         List<HRegionInfo> regions =
> >> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >> >>>>>>>>>         RegionScanner regionScanner = null;
> >> >>>>>>>>>         for (HRegionInfo region : regions) {
> >> >>>>>>>>>
> >> >>>>>>>>>             startKey = region.getStartKey();
> >> >>>>>>>>>             stopKey = region.getEndKey();
> >> >>>>>>>>>
> >> >>>>>>>>>             regionScanner = new RegionScanner(startKey,
> stopKey,
> >> >>>>>>>>> scanConfiguration);
> >> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
> >> >>>>>>> stopKey);
> >> >>>>>>>>>             if (regionScanner != null) {
> >> >>>>>>>>>                 regionScanners.add(regionScanner);
> >> >>>>>>>>>             }
> >> >>>>>>>>>         }
> >> >>>>>>>>>
> >> >>>>>>>>> I did some test for a tiny table and I think that the range
> for
> >> >> each
> >> >>>>>>> scan
> >> >>>>>>>>> works fine. Although, I though that it was interesting that
> the
> >> >> time
> >> >>>>>>> when I
> >> >>>>>>>>> execute distributed scan is about 6x.
> >> >>>>>>>>>
> >> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
> >> >> right.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <larsh@apache.org
> <javascript:;>>:
> >> >>>>>>>>>
> >> >>>>>>>>>> Which version of HBase?
> >> >>>>>>>>>> Can you show us the code?
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
> >> the
> >> >>>>>>> single
> >> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
> >> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
> >> each
> >> >>>> of
> >> >>>>>>>>>> your parallel scans?
> >> >>>>>>>>>>
> >> >>>>>>>>>> -- Lars
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> ________________________________
> >> >>>>>>>>>> From: Guillermo Ortiz <konstt2000@gmail.com <javascript:;>>
> >> >>>>>>>>>> To: "user@hbase.apache.org <javascript:;>" <
> user@hbase.apache.org <javascript:;>>
> >> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >> >>>>>>>>>> Subject: Scan vs Parallel scan.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Hi,
> >> >>>>>>>>>>
> >> >>>>>>>>>> I developed an distributed scan, I create an thread for each
> >> >> region.
> >> >>>>>>> After
> >> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3
> region
> >> >>>>>>> servers
> >> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
> >> >>>> execute a
> >> >>>>>>>>>> complete scan.
> >> >>>>>>>>>>
> >> >>>>>>>>>> My partitions are
> >> >>>>>>>>>> -01666 -> request 16665
> >> >>>>>>>>>> 016666-033332 -> request 16666
> >> >>>>>>>>>> 033332-049998 -> request 16666
> >> >>>>>>>>>> 049998-066664 -> request 16666
> >> >>>>>>>>>> 066664-083330 -> request 16666
> >> >>>>>>>>>> 083330- -> request 16671
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> >> >>>>>>>>>> Caching 10
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> >> >>>>>>>>>> Caching 100
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> >> >>>>>>>>>> Caching 1000
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> >> >> NORMAL:68288ms,Counter:2
> >> >>>>>>> ->
> >> >>>>>>>>>> Caching 1
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> >> >> NORMAL:2646ms,Counter:2
> >> >>>> ->
> >> >>>>>>>>>> Caching 100
> >> >>>>>>>>>>
> >> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> >> >> NORMAL:3903ms,Counter:2
> >> >>>> ->
> >> >>>>>>>>>> Caching 1000
> >> >>>>>>>>>>
> >> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
> >> know
> >> >>>> why
> >> >>>>>>>>>> it's
> >> >>>>>>>>>> so fast,, it's really much faster than execute an "count"
> from
> >> >> hbase
> >> >>>>>>>>>> shell,
> >> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it
> works
> >> >>>> better
> >> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Any clue about it?
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>

Re: Scan vs Parallel scan.

Posted by lars hofhansl <la...@apache.org>.

What specific version of 0.94 are you using?

In general, if you have multiple spindles (disks) and/or multiple CPU cores at the region server you should benefits from keeping multiple region server handler threads busy. I have experimented with this before and saw a close to linear speed up (up to the point where all disks/core were busy). Obviously this also assuming this is the only load you throw at the servers at this point.

Can you post your complete code to pastebin? Maybe even with some code to seed the data?
How do you run your callables? Did you configure the ExecuteService correctly (assuming you use one to run your callables)? 

Then we can run it and have a look.

Thanks.

-- Lars


----- Original Message -----
From: Guillermo Ortiz <ko...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Saturday, September 13, 2014 4:49 PM
Subject: Re: Scan vs Parallel scan.

What am I missing??




2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> For an partial scan, I guess that I call to the RS to get data, it starts
> looking in the store files and recollecting the data. (It doesn't write to
> the blockcache in both cases). It has ready the data and it gives to the
> client the data step by step, I mean,,, it depends the caching and batching
> parameters.
>
> Big differences that I see...
> I'm opening more connections to the Table, one for Region.
>
> I should check the single table scan, it looks like it does partial scans
> sequentially. Since you can see on the HBase Master how the request
> increase one after another, not all in the same time.
>
> 2014-09-12 15:23 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>
>> It doesn’t matter which RS, but that you have 1 thread for each region.
>>
>> So for each thread, what’s happening.
>> Step by step, what is the code doing.
>>
>> Now you’re comparing this against a single table scan, right?
>> What’s happening in the table scan…?
>>
>>
>> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>> > Right, My table for example has keys between 0-9. in three regions
>> > 0-2,3-7,7-9
>> > I lauch three partial scans in parallel. The scans that I'm executing
>> are:
>> > scan(0,2), scan(3,7), scan(7,9).
>> > Each region is if a different RS, so each thread goes to different RS.
>> It's
>> > not exactly like that, but on the benchmark case it's like it's working.
>> >
>> > Really the code will execute a thread for each Region not for each
>> > RegionServer. But in the test I only have two regions for regionServer.
>> I
>> > dont' think that's an important point, there're two threads for RS.
>> >
>> > 2014-09-12 14:48 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>> >
>> >> Ok, lets again take a step back…
>> >>
>> >> So you are comparing your partial scan(s) against a full table scan?
>> >>
>> >> If I understood your question, you launch 3 partial scans where you set
>> >> the start row and then end row of each scan, right?
>> >>
>> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>> >>
>> >>> Okay, then, the partial scan doesn't work as I think.
>> >>> How could it exceed the limit of a single region if I calculate the
>> >> limits?
>> >>>
>> >>>
>> >>> The only bad point that I see it's that If a region server has three
>> >>> regions of the same table,  I'm executing three partial scans about
>> this
>> >> RS
>> >>> and they could compete for resources (network, etc..) on this node.
>> It'd
>> >> be
>> >>> better to have one thread for RS. But, that doesn't answer your
>> >> questions.
>> >>>
>> >>> I keep thinking...
>> >>>
>> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I wanted to take a step back from the actual code and to stop and
>> think
>> >>>> about what you are doing and what HBase is doing under the covers.
>> >>>>
>> >>>> So in your code, you are asking HBase to do 3 separate scans and then
>> >> you
>> >>>> take the result set back and join it.
>> >>>>
>> >>>> What does HBase do when it does a range scan?
>> >>>> What happens when that range scan exceeds a single region?
>> >>>>
>> >>>> If you answer those questions… you’ll have your answer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com>
>> >> wrote:
>> >>>>
>> >>>>> It's not all the code, I set things like these as well:
>> >>>>> scan.setMaxVersions();
>> >>>>> scan.setCacheBlocks(false);
>> >>>>> ...
>> >>>>>
>> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>> >>>>>
>> >>>>>> yes, that is. I have changed the HBase version to 0.98
>> >>>>>>
>> >>>>>> I got the start and stop keys with this method:
>> >>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>      List<RegionScanner> regionScanners = new
>> >>>>>> ArrayList<RegionScanner>();
>> >>>>>>      byte[] startKey;
>> >>>>>>      byte[] stopKey;
>> >>>>>>      HConnection connection = null;
>> >>>>>>      HBaseAdmin hbaseAdmin = null;
>> >>>>>>      try {
>> >>>>>>          connection = HConnectionManager.
>> >>>>>> createConnection(HBaseConfiguration.create());
>> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>          List<HRegionInfo> regions =
>> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>          RegionScanner regionScanner = null;
>> >>>>>>          for (HRegionInfo region : regions) {
>> >>>>>>
>> >>>>>>              startKey = region.getStartKey();
>> >>>>>>              stopKey = region.getEndKey();
>> >>>>>>
>> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>> scanConfiguration);
>> >>>>>>              // regionScanner = createRegionScanner(startKey,
>> >>>> stopKey);
>> >>>>>>              if (regionScanner != null) {
>> >>>>>>                  regionScanners.add(regionScanner);
>> >>>>>>              }
>> >>>>>>          }
>> >>>>>>
>> >>>>>> And I execute the RegionScanner with this:
>> >>>>>> public List<Result> call() throws Exception {
>> >>>>>>      HConnection connection =
>> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>      HTableInterface table =
>> >>>>>> connection.getTable(configuration.getTable());
>> >>>>>>
>> >>>>>>  Scan scan = new Scan(startKey, stopKey);
>> >>>>>>      scan.setBatch(configuration.getBatch());
>> >>>>>>      scan.setCaching(configuration.getCaching());
>> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>> >>>>>>
>> >>>>>>      List<Result> results = new ArrayList<Result>();
>> >>>>>>      for (Result result : resultScanner) {
>> >>>>>>          results.add(result);
>> >>>>>>      }
>> >>>>>>
>> >>>>>>      connection.close();
>> >>>>>>      table.close();
>> >>>>>>
>> >>>>>>      return results;
>> >>>>>>  }
>> >>>>>>
>> >>>>>> They implement Callable.
>> >>>>>>
>> >>>>>>
>> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_segel@hotmail.com
>> >:
>> >>>>>>
>> >>>>>>> Lets take a step back….
>> >>>>>>>
>> >>>>>>> Your parallel scan is having the client create N threads where in
>> >> each
>> >>>>>>> thread, you’re doing a partial scan of the table where each
>> partial
>> >>>> scan
>> >>>>>>> takes the first and last row of each region?
>> >>>>>>>
>> >>>>>>> Is that correct?
>> >>>>>>>
>> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
>> konstt2000@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> I was checking a little bit more about,, I checked the cluster
>> and
>> >>>> data
>> >>>>>>> is
>> >>>>>>>> store in three different regions servers, each one in a
>> differente
>> >>>> node.
>> >>>>>>>> So, I guess the threads go to different hard-disks.
>> >>>>>>>>
>> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
>> scan
>> >>>>>>> than
>> >>>>>>>> this implementation. I based on this implementation
>> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>> >>>>>>>>
>> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
>> >:
>> >>>>>>>>
>> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> >>>>>>> although
>> >>>>>>>>> there is not difference.
>> >>>>>>>>> I disabled the table and disabled the blockcache for that family
>> >> and
>> >>>> I
>> >>>>>>> put
>> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
>> >>>>>>>>>
>> >>>>>>>>> I think that it's not possible that I executing an complete scan
>> >> for
>> >>>>>>> each
>> >>>>>>>>> thread since my data are the type:
>> >>>>>>>>> 000001 f:q value=1
>> >>>>>>>>> 000002 f:q value=2
>> >>>>>>>>> 000003 f:q value=3
>> >>>>>>>>> ...
>> >>>>>>>>>
>> >>>>>>>>> I add all the values and get the same result on a single scan
>> than
>> >> a
>> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
>> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>> >>>>>>> remember,
>> >>>>>>>>> but like 4x  of the scan time.
>> >>>>>>>>> I'm not using any filter for the scans.
>> >>>>>>>>>
>> >>>>>>>>> This is the way I calculate number of regions/scans
>> >>>>>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>>>>     List<RegionScanner> regionScanners = new
>> >>>>>>>>> ArrayList<RegionScanner>();
>> >>>>>>>>>     byte[] startKey;
>> >>>>>>>>>     byte[] stopKey;
>> >>>>>>>>>     HConnection connection = null;
>> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>> >>>>>>>>>     try {
>> >>>>>>>>>         connection =
>> >>>>>>>>>
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>>>>         List<HRegionInfo> regions =
>> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>>>>         RegionScanner regionScanner = null;
>> >>>>>>>>>         for (HRegionInfo region : regions) {
>> >>>>>>>>>
>> >>>>>>>>>             startKey = region.getStartKey();
>> >>>>>>>>>             stopKey = region.getEndKey();
>> >>>>>>>>>
>> >>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>>>>> scanConfiguration);
>> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
>> >>>>>>> stopKey);
>> >>>>>>>>>             if (regionScanner != null) {
>> >>>>>>>>>                 regionScanners.add(regionScanner);
>> >>>>>>>>>             }
>> >>>>>>>>>         }
>> >>>>>>>>>
>> >>>>>>>>> I did some test for a tiny table and I think that the range for
>> >> each
>> >>>>>>> scan
>> >>>>>>>>> works fine. Although, I though that it was interesting that the
>> >> time
>> >>>>>>> when I
>> >>>>>>>>> execute distributed scan is about 6x.
>> >>>>>>>>>
>> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
>> >> right.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>> >>>>>>>>>
>> >>>>>>>>>> Which version of HBase?
>> >>>>>>>>>> Can you show us the code?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
>> the
>> >>>>>>> single
>> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
>> each
>> >>>> of
>> >>>>>>>>>> your parallel scans?
>> >>>>>>>>>>
>> >>>>>>>>>> -- Lars
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> ________________________________
>> >>>>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
>> >>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>> >>>>>>>>>> Subject: Scan vs Parallel scan.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I developed an distributed scan, I create an thread for each
>> >> region.
>> >>>>>>> After
>> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>> >>>>>>> servers
>> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>> >>>> execute a
>> >>>>>>>>>> complete scan.
>> >>>>>>>>>>
>> >>>>>>>>>> My partitions are
>> >>>>>>>>>> -01666 -> request 16665
>> >>>>>>>>>> 016666-033332 -> request 16666
>> >>>>>>>>>> 033332-049998 -> request 16666
>> >>>>>>>>>> 049998-066664 -> request 16666
>> >>>>>>>>>> 066664-083330 -> request 16666
>> >>>>>>>>>> 083330- -> request 16671
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:22089ms,Counter:2 ->
>> >>>>>>>>>> Caching 10
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALJEL:16598ms,Counter:2 ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:16497ms,Counter:2 ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> >> NORMAL:68288ms,Counter:2
>> >>>>>>> ->
>> >>>>>>>>>> Caching 1
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> >> NORMAL:2646ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> >> NORMAL:3903ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
>> know
>> >>>> why
>> >>>>>>>>>> it's
>> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from
>> >> hbase
>> >>>>>>>>>> shell,
>> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>> >>>> better
>> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>> >>>>>>>>>>
>> >>>>>>>>>> Any clue about it?
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

What am I missing??

2014-09-12 16:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> For an partial scan, I guess that I call to the RS to get data, it starts
> looking in the store files and recollecting the data. (It doesn't write to
> the blockcache in both cases). It has ready the data and it gives to the
> client the data step by step, I mean,,, it depends the caching and batching
> parameters.
>
> Big differences that I see...
> I'm opening more connections to the Table, one for Region.
>
> I should check the single table scan, it looks like it does partial scans
> sequentially. Since you can see on the HBase Master how the request
> increase one after another, not all in the same time.
>
> 2014-09-12 15:23 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>
>> It doesn’t matter which RS, but that you have 1 thread for each region.
>>
>> So for each thread, what’s happening.
>> Step by step, what is the code doing.
>>
>> Now you’re comparing this against a single table scan, right?
>> What’s happening in the table scan…?
>>
>>
>> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>> > Right, My table for example has keys between 0-9. in three regions
>> > 0-2,3-7,7-9
>> > I lauch three partial scans in parallel. The scans that I'm executing
>> are:
>> > scan(0,2), scan(3,7), scan(7,9).
>> > Each region is if a different RS, so each thread goes to different RS.
>> It's
>> > not exactly like that, but on the benchmark case it's like it's working.
>> >
>> > Really the code will execute a thread for each Region not for each
>> > RegionServer. But in the test I only have two regions for regionServer.
>> I
>> > dont' think that's an important point, there're two threads for RS.
>> >
>> > 2014-09-12 14:48 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>> >
>> >> Ok, lets again take a step back…
>> >>
>> >> So you are comparing your partial scan(s) against a full table scan?
>> >>
>> >> If I understood your question, you launch 3 partial scans where you set
>> >> the start row and then end row of each scan, right?
>> >>
>> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>> >>
>> >>> Okay, then, the partial scan doesn't work as I think.
>> >>> How could it exceed the limit of a single region if I calculate the
>> >> limits?
>> >>>
>> >>>
>> >>> The only bad point that I see it's that If a region server has three
>> >>> regions of the same table,  I'm executing three partial scans about
>> this
>> >> RS
>> >>> and they could compete for resources (network, etc..) on this node.
>> It'd
>> >> be
>> >>> better to have one thread for RS. But, that doesn't answer your
>> >> questions.
>> >>>
>> >>> I keep thinking...
>> >>>
>> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I wanted to take a step back from the actual code and to stop and
>> think
>> >>>> about what you are doing and what HBase is doing under the covers.
>> >>>>
>> >>>> So in your code, you are asking HBase to do 3 separate scans and then
>> >> you
>> >>>> take the result set back and join it.
>> >>>>
>> >>>> What does HBase do when it does a range scan?
>> >>>> What happens when that range scan exceeds a single region?
>> >>>>
>> >>>> If you answer those questions… you’ll have your answer.
>> >>>>
>> >>>> HTH
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com>
>> >> wrote:
>> >>>>
>> >>>>> It's not all the code, I set things like these as well:
>> >>>>> scan.setMaxVersions();
>> >>>>> scan.setCacheBlocks(false);
>> >>>>> ...
>> >>>>>
>> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>> >>>>>
>> >>>>>> yes, that is. I have changed the HBase version to 0.98
>> >>>>>>
>> >>>>>> I got the start and stop keys with this method:
>> >>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>      List<RegionScanner> regionScanners = new
>> >>>>>> ArrayList<RegionScanner>();
>> >>>>>>      byte[] startKey;
>> >>>>>>      byte[] stopKey;
>> >>>>>>      HConnection connection = null;
>> >>>>>>      HBaseAdmin hbaseAdmin = null;
>> >>>>>>      try {
>> >>>>>>          connection = HConnectionManager.
>> >>>>>> createConnection(HBaseConfiguration.create());
>> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>          List<HRegionInfo> regions =
>> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>          RegionScanner regionScanner = null;
>> >>>>>>          for (HRegionInfo region : regions) {
>> >>>>>>
>> >>>>>>              startKey = region.getStartKey();
>> >>>>>>              stopKey = region.getEndKey();
>> >>>>>>
>> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>> scanConfiguration);
>> >>>>>>              // regionScanner = createRegionScanner(startKey,
>> >>>> stopKey);
>> >>>>>>              if (regionScanner != null) {
>> >>>>>>                  regionScanners.add(regionScanner);
>> >>>>>>              }
>> >>>>>>          }
>> >>>>>>
>> >>>>>> And I execute the RegionScanner with this:
>> >>>>>> public List<Result> call() throws Exception {
>> >>>>>>      HConnection connection =
>> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>      HTableInterface table =
>> >>>>>> connection.getTable(configuration.getTable());
>> >>>>>>
>> >>>>>>  Scan scan = new Scan(startKey, stopKey);
>> >>>>>>      scan.setBatch(configuration.getBatch());
>> >>>>>>      scan.setCaching(configuration.getCaching());
>> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>> >>>>>>
>> >>>>>>      List<Result> results = new ArrayList<Result>();
>> >>>>>>      for (Result result : resultScanner) {
>> >>>>>>          results.add(result);
>> >>>>>>      }
>> >>>>>>
>> >>>>>>      connection.close();
>> >>>>>>      table.close();
>> >>>>>>
>> >>>>>>      return results;
>> >>>>>>  }
>> >>>>>>
>> >>>>>> They implement Callable.
>> >>>>>>
>> >>>>>>
>> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_segel@hotmail.com
>> >:
>> >>>>>>
>> >>>>>>> Lets take a step back….
>> >>>>>>>
>> >>>>>>> Your parallel scan is having the client create N threads where in
>> >> each
>> >>>>>>> thread, you’re doing a partial scan of the table where each
>> partial
>> >>>> scan
>> >>>>>>> takes the first and last row of each region?
>> >>>>>>>
>> >>>>>>> Is that correct?
>> >>>>>>>
>> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <
>> konstt2000@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> I was checking a little bit more about,, I checked the cluster
>> and
>> >>>> data
>> >>>>>>> is
>> >>>>>>>> store in three different regions servers, each one in a
>> differente
>> >>>> node.
>> >>>>>>>> So, I guess the threads go to different hard-disks.
>> >>>>>>>>
>> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
>> scan
>> >>>>>>> than
>> >>>>>>>> this implementation. I based on this implementation
>> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>> >>>>>>>>
>> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
>> >:
>> >>>>>>>>
>> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> >>>>>>> although
>> >>>>>>>>> there is not difference.
>> >>>>>>>>> I disabled the table and disabled the blockcache for that family
>> >> and
>> >>>> I
>> >>>>>>> put
>> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
>> >>>>>>>>>
>> >>>>>>>>> I think that it's not possible that I executing an complete scan
>> >> for
>> >>>>>>> each
>> >>>>>>>>> thread since my data are the type:
>> >>>>>>>>> 000001 f:q value=1
>> >>>>>>>>> 000002 f:q value=2
>> >>>>>>>>> 000003 f:q value=3
>> >>>>>>>>> ...
>> >>>>>>>>>
>> >>>>>>>>> I add all the values and get the same result on a single scan
>> than
>> >> a
>> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
>> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>> >>>>>>> remember,
>> >>>>>>>>> but like 4x  of the scan time.
>> >>>>>>>>> I'm not using any filter for the scans.
>> >>>>>>>>>
>> >>>>>>>>> This is the way I calculate number of regions/scans
>> >>>>>>>>> private List<RegionScanner> generatePartitions() {
>> >>>>>>>>>     List<RegionScanner> regionScanners = new
>> >>>>>>>>> ArrayList<RegionScanner>();
>> >>>>>>>>>     byte[] startKey;
>> >>>>>>>>>     byte[] stopKey;
>> >>>>>>>>>     HConnection connection = null;
>> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>> >>>>>>>>>     try {
>> >>>>>>>>>         connection =
>> >>>>>>>>>
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>> >>>>>>>>>         List<HRegionInfo> regions =
>> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>>>>>>>>         RegionScanner regionScanner = null;
>> >>>>>>>>>         for (HRegionInfo region : regions) {
>> >>>>>>>>>
>> >>>>>>>>>             startKey = region.getStartKey();
>> >>>>>>>>>             stopKey = region.getEndKey();
>> >>>>>>>>>
>> >>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
>> >>>>>>>>> scanConfiguration);
>> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
>> >>>>>>> stopKey);
>> >>>>>>>>>             if (regionScanner != null) {
>> >>>>>>>>>                 regionScanners.add(regionScanner);
>> >>>>>>>>>             }
>> >>>>>>>>>         }
>> >>>>>>>>>
>> >>>>>>>>> I did some test for a tiny table and I think that the range for
>> >> each
>> >>>>>>> scan
>> >>>>>>>>> works fine. Although, I though that it was interesting that the
>> >> time
>> >>>>>>> when I
>> >>>>>>>>> execute distributed scan is about 6x.
>> >>>>>>>>>
>> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
>> >> right.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>> >>>>>>>>>
>> >>>>>>>>>> Which version of HBase?
>> >>>>>>>>>> Can you show us the code?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
>> the
>> >>>>>>> single
>> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
>> each
>> >>>> of
>> >>>>>>>>>> your parallel scans?
>> >>>>>>>>>>
>> >>>>>>>>>> -- Lars
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> ________________________________
>> >>>>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
>> >>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>> >>>>>>>>>> Subject: Scan vs Parallel scan.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I developed an distributed scan, I create an thread for each
>> >> region.
>> >>>>>>> After
>> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>> >>>>>>> servers
>> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>> >>>> execute a
>> >>>>>>>>>> complete scan.
>> >>>>>>>>>>
>> >>>>>>>>>> My partitions are
>> >>>>>>>>>> -01666 -> request 16665
>> >>>>>>>>>> 016666-033332 -> request 16666
>> >>>>>>>>>> 033332-049998 -> request 16666
>> >>>>>>>>>> 049998-066664 -> request 16666
>> >>>>>>>>>> 066664-083330 -> request 16666
>> >>>>>>>>>> 083330- -> request 16671
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:22089ms,Counter:2 ->
>> >>>>>>>>>> Caching 10
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALJEL:16598ms,Counter:2 ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> >>>>>>> PARALLEL:16497ms,Counter:2 ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> >> NORMAL:68288ms,Counter:2
>> >>>>>>> ->
>> >>>>>>>>>> Caching 1
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> >> NORMAL:2646ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 100
>> >>>>>>>>>>
>> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> >> NORMAL:3903ms,Counter:2
>> >>>> ->
>> >>>>>>>>>> Caching 1000
>> >>>>>>>>>>
>> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
>> know
>> >>>> why
>> >>>>>>>>>> it's
>> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from
>> >> hbase
>> >>>>>>>>>> shell,
>> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>> >>>> better
>> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>> >>>>>>>>>>
>> >>>>>>>>>> Any clue about it?
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

For an partial scan, I guess that I call to the RS to get data, it starts
looking in the store files and recollecting the data. (It doesn't write to
the blockcache in both cases). It has ready the data and it gives to the
client the data step by step, I mean,,, it depends the caching and batching
parameters.

Big differences that I see...
I'm opening more connections to the Table, one for Region.

I should check the single table scan, it looks like it does partial scans
sequentially. Since you can see on the HBase Master how the request
increase one after another, not all in the same time.

2014-09-12 15:23 GMT+02:00 Michael Segel <mi...@hotmail.com>:

> It doesn’t matter which RS, but that you have 1 thread for each region.
>
> So for each thread, what’s happening.
> Step by step, what is the code doing.
>
> Now you’re comparing this against a single table scan, right?
> What’s happening in the table scan…?
>
>
> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> > Right, My table for example has keys between 0-9. in three regions
> > 0-2,3-7,7-9
> > I lauch three partial scans in parallel. The scans that I'm executing
> are:
> > scan(0,2), scan(3,7), scan(7,9).
> > Each region is if a different RS, so each thread goes to different RS.
> It's
> > not exactly like that, but on the benchmark case it's like it's working.
> >
> > Really the code will execute a thread for each Region not for each
> > RegionServer. But in the test I only have two regions for regionServer. I
> > dont' think that's an important point, there're two threads for RS.
> >
> > 2014-09-12 14:48 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> >
> >> Ok, lets again take a step back…
> >>
> >> So you are comparing your partial scan(s) against a full table scan?
> >>
> >> If I understood your question, you launch 3 partial scans where you set
> >> the start row and then end row of each scan, right?
> >>
> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >>
> >>> Okay, then, the partial scan doesn't work as I think.
> >>> How could it exceed the limit of a single region if I calculate the
> >> limits?
> >>>
> >>>
> >>> The only bad point that I see it's that If a region server has three
> >>> regions of the same table,  I'm executing three partial scans about
> this
> >> RS
> >>> and they could compete for resources (network, etc..) on this node.
> It'd
> >> be
> >>> better to have one thread for RS. But, that doesn't answer your
> >> questions.
> >>>
> >>> I keep thinking...
> >>>
> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> >>>
> >>>> Hi,
> >>>>
> >>>> I wanted to take a step back from the actual code and to stop and
> think
> >>>> about what you are doing and what HBase is doing under the covers.
> >>>>
> >>>> So in your code, you are asking HBase to do 3 separate scans and then
> >> you
> >>>> take the result set back and join it.
> >>>>
> >>>> What does HBase do when it does a range scan?
> >>>> What happens when that range scan exceeds a single region?
> >>>>
> >>>> If you answer those questions… you’ll have your answer.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com>
> >> wrote:
> >>>>
> >>>>> It's not all the code, I set things like these as well:
> >>>>> scan.setMaxVersions();
> >>>>> scan.setCacheBlocks(false);
> >>>>> ...
> >>>>>
> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >>>>>
> >>>>>> yes, that is. I have changed the HBase version to 0.98
> >>>>>>
> >>>>>> I got the start and stop keys with this method:
> >>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>      List<RegionScanner> regionScanners = new
> >>>>>> ArrayList<RegionScanner>();
> >>>>>>      byte[] startKey;
> >>>>>>      byte[] stopKey;
> >>>>>>      HConnection connection = null;
> >>>>>>      HBaseAdmin hbaseAdmin = null;
> >>>>>>      try {
> >>>>>>          connection = HConnectionManager.
> >>>>>> createConnection(HBaseConfiguration.create());
> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>          List<HRegionInfo> regions =
> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>          RegionScanner regionScanner = null;
> >>>>>>          for (HRegionInfo region : regions) {
> >>>>>>
> >>>>>>              startKey = region.getStartKey();
> >>>>>>              stopKey = region.getEndKey();
> >>>>>>
> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>> scanConfiguration);
> >>>>>>              // regionScanner = createRegionScanner(startKey,
> >>>> stopKey);
> >>>>>>              if (regionScanner != null) {
> >>>>>>                  regionScanners.add(regionScanner);
> >>>>>>              }
> >>>>>>          }
> >>>>>>
> >>>>>> And I execute the RegionScanner with this:
> >>>>>> public List<Result> call() throws Exception {
> >>>>>>      HConnection connection =
> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>      HTableInterface table =
> >>>>>> connection.getTable(configuration.getTable());
> >>>>>>
> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> >>>>>>      scan.setBatch(configuration.getBatch());
> >>>>>>      scan.setCaching(configuration.getCaching());
> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> >>>>>>
> >>>>>>      List<Result> results = new ArrayList<Result>();
> >>>>>>      for (Result result : resultScanner) {
> >>>>>>          results.add(result);
> >>>>>>      }
> >>>>>>
> >>>>>>      connection.close();
> >>>>>>      table.close();
> >>>>>>
> >>>>>>      return results;
> >>>>>>  }
> >>>>>>
> >>>>>> They implement Callable.
> >>>>>>
> >>>>>>
> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_segel@hotmail.com
> >:
> >>>>>>
> >>>>>>> Lets take a step back….
> >>>>>>>
> >>>>>>> Your parallel scan is having the client create N threads where in
> >> each
> >>>>>>> thread, you’re doing a partial scan of the table where each partial
> >>>> scan
> >>>>>>> takes the first and last row of each region?
> >>>>>>>
> >>>>>>> Is that correct?
> >>>>>>>
> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2000@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I was checking a little bit more about,, I checked the cluster and
> >>>> data
> >>>>>>> is
> >>>>>>>> store in three different regions servers, each one in a differente
> >>>> node.
> >>>>>>>> So, I guess the threads go to different hard-disks.
> >>>>>>>>
> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
> scan
> >>>>>>> than
> >>>>>>>> this implementation. I based on this implementation
> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>>>>>
> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2000@gmail.com
> >:
> >>>>>>>>
> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
> >>>>>>> although
> >>>>>>>>> there is not difference.
> >>>>>>>>> I disabled the table and disabled the blockcache for that family
> >> and
> >>>> I
> >>>>>>> put
> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> >>>>>>>>>
> >>>>>>>>> I think that it's not possible that I executing an complete scan
> >> for
> >>>>>>> each
> >>>>>>>>> thread since my data are the type:
> >>>>>>>>> 000001 f:q value=1
> >>>>>>>>> 000002 f:q value=2
> >>>>>>>>> 000003 f:q value=3
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>> I add all the values and get the same result on a single scan
> than
> >> a
> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
> >>>>>>> remember,
> >>>>>>>>> but like 4x  of the scan time.
> >>>>>>>>> I'm not using any filter for the scans.
> >>>>>>>>>
> >>>>>>>>> This is the way I calculate number of regions/scans
> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>>>>     List<RegionScanner> regionScanners = new
> >>>>>>>>> ArrayList<RegionScanner>();
> >>>>>>>>>     byte[] startKey;
> >>>>>>>>>     byte[] stopKey;
> >>>>>>>>>     HConnection connection = null;
> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> >>>>>>>>>     try {
> >>>>>>>>>         connection =
> >>>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>>>>         List<HRegionInfo> regions =
> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>>>>         RegionScanner regionScanner = null;
> >>>>>>>>>         for (HRegionInfo region : regions) {
> >>>>>>>>>
> >>>>>>>>>             startKey = region.getStartKey();
> >>>>>>>>>             stopKey = region.getEndKey();
> >>>>>>>>>
> >>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>>>>> scanConfiguration);
> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
> >>>>>>> stopKey);
> >>>>>>>>>             if (regionScanner != null) {
> >>>>>>>>>                 regionScanners.add(regionScanner);
> >>>>>>>>>             }
> >>>>>>>>>         }
> >>>>>>>>>
> >>>>>>>>> I did some test for a tiny table and I think that the range for
> >> each
> >>>>>>> scan
> >>>>>>>>> works fine. Although, I though that it was interesting that the
> >> time
> >>>>>>> when I
> >>>>>>>>> execute distributed scan is about 6x.
> >>>>>>>>>
> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
> >> right.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
> >>>>>>>>>
> >>>>>>>>>> Which version of HBase?
> >>>>>>>>>> Can you show us the code?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
> the
> >>>>>>> single
> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
> each
> >>>> of
> >>>>>>>>>> your parallel scans?
> >>>>>>>>>>
> >>>>>>>>>> -- Lars
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ________________________________
> >>>>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
> >>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>>>>>>>>> Subject: Scan vs Parallel scan.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I developed an distributed scan, I create an thread for each
> >> region.
> >>>>>>> After
> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
> >>>>>>> servers
> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
> >>>> execute a
> >>>>>>>>>> complete scan.
> >>>>>>>>>>
> >>>>>>>>>> My partitions are
> >>>>>>>>>> -01666 -> request 16665
> >>>>>>>>>> 016666-033332 -> request 16666
> >>>>>>>>>> 033332-049998 -> request 16666
> >>>>>>>>>> 049998-066664 -> request 16666
> >>>>>>>>>> 066664-083330 -> request 16666
> >>>>>>>>>> 083330- -> request 16671
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> >>>>>>>>>> Caching 10
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> >>>>>>>>>> Caching 100
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> >>>>>>>>>> Caching 1000
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> >> NORMAL:68288ms,Counter:2
> >>>>>>> ->
> >>>>>>>>>> Caching 1
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> >> NORMAL:2646ms,Counter:2
> >>>> ->
> >>>>>>>>>> Caching 100
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> >> NORMAL:3903ms,Counter:2
> >>>> ->
> >>>>>>>>>> Caching 1000
> >>>>>>>>>>
> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
> know
> >>>> why
> >>>>>>>>>> it's
> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from
> >> hbase
> >>>>>>>>>> shell,
> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
> >>>> better
> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> >>>>>>>>>>
> >>>>>>>>>> Any clue about it?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Scan vs Parallel scan.

Posted by Michael Segel <mi...@hotmail.com>.

It doesn’t matter which RS, but that you have 1 thread for each region. 

So for each thread, what’s happening. 
Step by step, what is the code doing. 

Now you’re comparing this against a single table scan, right? 
What’s happening in the table scan…?


On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <ko...@gmail.com> wrote:

> Right, My table for example has keys between 0-9. in three regions
> 0-2,3-7,7-9
> I lauch three partial scans in parallel. The scans that I'm executing are:
> scan(0,2), scan(3,7), scan(7,9).
> Each region is if a different RS, so each thread goes to different RS. It's
> not exactly like that, but on the benchmark case it's like it's working.
> 
> Really the code will execute a thread for each Region not for each
> RegionServer. But in the test I only have two regions for regionServer. I
> dont' think that's an important point, there're two threads for RS.
> 
> 2014-09-12 14:48 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> 
>> Ok, lets again take a step back…
>> 
>> So you are comparing your partial scan(s) against a full table scan?
>> 
>> If I understood your question, you launch 3 partial scans where you set
>> the start row and then end row of each scan, right?
>> 
>> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>> 
>>> Okay, then, the partial scan doesn't work as I think.
>>> How could it exceed the limit of a single region if I calculate the
>> limits?
>>> 
>>> 
>>> The only bad point that I see it's that If a region server has three
>>> regions of the same table,  I'm executing three partial scans about this
>> RS
>>> and they could compete for resources (network, etc..) on this node. It'd
>> be
>>> better to have one thread for RS. But, that doesn't answer your
>> questions.
>>> 
>>> I keep thinking...
>>> 
>>> 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>>> 
>>>> Hi,
>>>> 
>>>> I wanted to take a step back from the actual code and to stop and think
>>>> about what you are doing and what HBase is doing under the covers.
>>>> 
>>>> So in your code, you are asking HBase to do 3 separate scans and then
>> you
>>>> take the result set back and join it.
>>>> 
>>>> What does HBase do when it does a range scan?
>>>> What happens when that range scan exceeds a single region?
>>>> 
>>>> If you answer those questions… you’ll have your answer.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>>> 
>>>>> It's not all the code, I set things like these as well:
>>>>> scan.setMaxVersions();
>>>>> scan.setCacheBlocks(false);
>>>>> ...
>>>>> 
>>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>>>>> 
>>>>>> yes, that is. I have changed the HBase version to 0.98
>>>>>> 
>>>>>> I got the start and stop keys with this method:
>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>      List<RegionScanner> regionScanners = new
>>>>>> ArrayList<RegionScanner>();
>>>>>>      byte[] startKey;
>>>>>>      byte[] stopKey;
>>>>>>      HConnection connection = null;
>>>>>>      HBaseAdmin hbaseAdmin = null;
>>>>>>      try {
>>>>>>          connection = HConnectionManager.
>>>>>> createConnection(HBaseConfiguration.create());
>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>>>>>>          List<HRegionInfo> regions =
>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>          RegionScanner regionScanner = null;
>>>>>>          for (HRegionInfo region : regions) {
>>>>>> 
>>>>>>              startKey = region.getStartKey();
>>>>>>              stopKey = region.getEndKey();
>>>>>> 
>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>>>>>> scanConfiguration);
>>>>>>              // regionScanner = createRegionScanner(startKey,
>>>> stopKey);
>>>>>>              if (regionScanner != null) {
>>>>>>                  regionScanners.add(regionScanner);
>>>>>>              }
>>>>>>          }
>>>>>> 
>>>>>> And I execute the RegionScanner with this:
>>>>>> public List<Result> call() throws Exception {
>>>>>>      HConnection connection =
>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>      HTableInterface table =
>>>>>> connection.getTable(configuration.getTable());
>>>>>> 
>>>>>>  Scan scan = new Scan(startKey, stopKey);
>>>>>>      scan.setBatch(configuration.getBatch());
>>>>>>      scan.setCaching(configuration.getCaching());
>>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>>>>>> 
>>>>>>      List<Result> results = new ArrayList<Result>();
>>>>>>      for (Result result : resultScanner) {
>>>>>>          results.add(result);
>>>>>>      }
>>>>>> 
>>>>>>      connection.close();
>>>>>>      table.close();
>>>>>> 
>>>>>>      return results;
>>>>>>  }
>>>>>> 
>>>>>> They implement Callable.
>>>>>> 
>>>>>> 
>>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>>>>>> 
>>>>>>> Lets take a step back….
>>>>>>> 
>>>>>>> Your parallel scan is having the client create N threads where in
>> each
>>>>>>> thread, you’re doing a partial scan of the table where each partial
>>>> scan
>>>>>>> takes the first and last row of each region?
>>>>>>> 
>>>>>>> Is that correct?
>>>>>>> 
>>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I was checking a little bit more about,, I checked the cluster and
>>>> data
>>>>>>> is
>>>>>>>> store in three different regions servers, each one in a differente
>>>> node.
>>>>>>>> So, I guess the threads go to different hard-disks.
>>>>>>>> 
>>>>>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>>>>>> than
>>>>>>>> this implementation. I based on this implementation
>>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>>>>>> 
>>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>>>>>>>> 
>>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>>>>>> although
>>>>>>>>> there is not difference.
>>>>>>>>> I disabled the table and disabled the blockcache for that family
>> and
>>>> I
>>>>>>> put
>>>>>>>>> scan.setBlockcache(false) as well for both cases.
>>>>>>>>> 
>>>>>>>>> I think that it's not possible that I executing an complete scan
>> for
>>>>>>> each
>>>>>>>>> thread since my data are the type:
>>>>>>>>> 000001 f:q value=1
>>>>>>>>> 000002 f:q value=2
>>>>>>>>> 000003 f:q value=3
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>> I add all the values and get the same result on a single scan than
>> a
>>>>>>>>> distributed, so, I guess that DistributedScan did well.
>>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>>>>>> remember,
>>>>>>>>> but like 4x  of the scan time.
>>>>>>>>> I'm not using any filter for the scans.
>>>>>>>>> 
>>>>>>>>> This is the way I calculate number of regions/scans
>>>>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>>>>     List<RegionScanner> regionScanners = new
>>>>>>>>> ArrayList<RegionScanner>();
>>>>>>>>>     byte[] startKey;
>>>>>>>>>     byte[] stopKey;
>>>>>>>>>     HConnection connection = null;
>>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>>>>>>>>>     try {
>>>>>>>>>         connection =
>>>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>>>>>>>>>         List<HRegionInfo> regions =
>>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>>>>         RegionScanner regionScanner = null;
>>>>>>>>>         for (HRegionInfo region : regions) {
>>>>>>>>> 
>>>>>>>>>             startKey = region.getStartKey();
>>>>>>>>>             stopKey = region.getEndKey();
>>>>>>>>> 
>>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
>>>>>>>>> scanConfiguration);
>>>>>>>>>             // regionScanner = createRegionScanner(startKey,
>>>>>>> stopKey);
>>>>>>>>>             if (regionScanner != null) {
>>>>>>>>>                 regionScanners.add(regionScanner);
>>>>>>>>>             }
>>>>>>>>>         }
>>>>>>>>> 
>>>>>>>>> I did some test for a tiny table and I think that the range for
>> each
>>>>>>> scan
>>>>>>>>> works fine. Although, I though that it was interesting that the
>> time
>>>>>>> when I
>>>>>>>>> execute distributed scan is about 6x.
>>>>>>>>> 
>>>>>>>>> I'm going to check about the hard disks, but I think that ti's
>> right.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>>>>>>>>> 
>>>>>>>>>> Which version of HBase?
>>>>>>>>>> Can you show us the code?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>>>>>> single
>>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>>>>>> Are you sure you're not accidentally scanning all the data in each
>>>> of
>>>>>>>>>> your parallel scans?
>>>>>>>>>> 
>>>>>>>>>> -- Lars
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
>>>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>>>>>> Subject: Scan vs Parallel scan.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I developed an distributed scan, I create an thread for each
>> region.
>>>>>>> After
>>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>>>>>> servers
>>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>>>> execute a
>>>>>>>>>> complete scan.
>>>>>>>>>> 
>>>>>>>>>> My partitions are
>>>>>>>>>> -01666 -> request 16665
>>>>>>>>>> 016666-033332 -> request 16666
>>>>>>>>>> 033332-049998 -> request 16666
>>>>>>>>>> 049998-066664 -> request 16666
>>>>>>>>>> 066664-083330 -> request 16666
>>>>>>>>>> 083330- -> request 16671
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>>>>>> PARALLEL:22089ms,Counter:2 ->
>>>>>>>>>> Caching 10
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>>>>>> PARALJEL:16598ms,Counter:2 ->
>>>>>>>>>> Caching 100
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>>>>>> PARALLEL:16497ms,Counter:2 ->
>>>>>>>>>> Caching 1000
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> NORMAL:68288ms,Counter:2
>>>>>>> ->
>>>>>>>>>> Caching 1
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> NORMAL:2646ms,Counter:2
>>>> ->
>>>>>>>>>> Caching 100
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> NORMAL:3903ms,Counter:2
>>>> ->
>>>>>>>>>> Caching 1000
>>>>>>>>>> 
>>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
>>>> why
>>>>>>>>>> it's
>>>>>>>>>> so fast,, it's really much faster than execute an "count" from
>> hbase
>>>>>>>>>> shell,
>>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>>>> better
>>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>>>>>> 
>>>>>>>>>> Any clue about it?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

Right, My table for example has keys between 0-9. in three regions
0-2,3-7,7-9
I lauch three partial scans in parallel. The scans that I'm executing are:
scan(0,2), scan(3,7), scan(7,9).
Each region is if a different RS, so each thread goes to different RS. It's
not exactly like that, but on the benchmark case it's like it's working.

Really the code will execute a thread for each Region not for each
RegionServer. But in the test I only have two regions for regionServer. I
dont' think that's an important point, there're two threads for RS.

2014-09-12 14:48 GMT+02:00 Michael Segel <mi...@hotmail.com>:

> Ok, lets again take a step back…
>
> So you are comparing your partial scan(s) against a full table scan?
>
> If I understood your question, you launch 3 partial scans where you set
> the start row and then end row of each scan, right?
>
> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> > Okay, then, the partial scan doesn't work as I think.
> > How could it exceed the limit of a single region if I calculate the
> limits?
> >
> >
> > The only bad point that I see it's that If a region server has three
> > regions of the same table,  I'm executing three partial scans about this
> RS
> > and they could compete for resources (network, etc..) on this node. It'd
> be
> > better to have one thread for RS. But, that doesn't answer your
> questions.
> >
> > I keep thinking...
> >
> > 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> >
> >> Hi,
> >>
> >> I wanted to take a step back from the actual code and to stop and think
> >> about what you are doing and what HBase is doing under the covers.
> >>
> >> So in your code, you are asking HBase to do 3 separate scans and then
> you
> >> take the result set back and join it.
> >>
> >> What does HBase do when it does a range scan?
> >> What happens when that range scan exceeds a single region?
> >>
> >> If you answer those questions… you’ll have your answer.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >>
> >>> It's not all the code, I set things like these as well:
> >>> scan.setMaxVersions();
> >>> scan.setCacheBlocks(false);
> >>> ...
> >>>
> >>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >>>
> >>>> yes, that is. I have changed the HBase version to 0.98
> >>>>
> >>>> I got the start and stop keys with this method:
> >>>> private List<RegionScanner> generatePartitions() {
> >>>>       List<RegionScanner> regionScanners = new
> >>>> ArrayList<RegionScanner>();
> >>>>       byte[] startKey;
> >>>>       byte[] stopKey;
> >>>>       HConnection connection = null;
> >>>>       HBaseAdmin hbaseAdmin = null;
> >>>>       try {
> >>>>           connection = HConnectionManager.
> >>>> createConnection(HBaseConfiguration.create());
> >>>>           hbaseAdmin = new HBaseAdmin(connection);
> >>>>           List<HRegionInfo> regions =
> >>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>           RegionScanner regionScanner = null;
> >>>>           for (HRegionInfo region : regions) {
> >>>>
> >>>>               startKey = region.getStartKey();
> >>>>               stopKey = region.getEndKey();
> >>>>
> >>>>               regionScanner = new RegionScanner(startKey, stopKey,
> >>>> scanConfiguration);
> >>>>               // regionScanner = createRegionScanner(startKey,
> >> stopKey);
> >>>>               if (regionScanner != null) {
> >>>>                   regionScanners.add(regionScanner);
> >>>>               }
> >>>>           }
> >>>>
> >>>> And I execute the RegionScanner with this:
> >>>> public List<Result> call() throws Exception {
> >>>>       HConnection connection =
> >>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>       HTableInterface table =
> >>>> connection.getTable(configuration.getTable());
> >>>>
> >>>>   Scan scan = new Scan(startKey, stopKey);
> >>>>       scan.setBatch(configuration.getBatch());
> >>>>       scan.setCaching(configuration.getCaching());
> >>>>       ResultScanner resultScanner = table.getScanner(scan);
> >>>>
> >>>>       List<Result> results = new ArrayList<Result>();
> >>>>       for (Result result : resultScanner) {
> >>>>           results.add(result);
> >>>>       }
> >>>>
> >>>>       connection.close();
> >>>>       table.close();
> >>>>
> >>>>       return results;
> >>>>   }
> >>>>
> >>>> They implement Callable.
> >>>>
> >>>>
> >>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> >>>>
> >>>>> Lets take a step back….
> >>>>>
> >>>>> Your parallel scan is having the client create N threads where in
> each
> >>>>> thread, you’re doing a partial scan of the table where each partial
> >> scan
> >>>>> takes the first and last row of each region?
> >>>>>
> >>>>> Is that correct?
> >>>>>
> >>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I was checking a little bit more about,, I checked the cluster and
> >> data
> >>>>> is
> >>>>>> store in three different regions servers, each one in a differente
> >> node.
> >>>>>> So, I guess the threads go to different hard-disks.
> >>>>>>
> >>>>>> If someone has an idea or suggestion.. why it's faster a single scan
> >>>>> than
> >>>>>> this implementation. I based on this implementation
> >>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>>>
> >>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >>>>>>
> >>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
> >>>>> although
> >>>>>>> there is not difference.
> >>>>>>> I disabled the table and disabled the blockcache for that family
> and
> >> I
> >>>>> put
> >>>>>>> scan.setBlockcache(false) as well for both cases.
> >>>>>>>
> >>>>>>> I think that it's not possible that I executing an complete scan
> for
> >>>>> each
> >>>>>>> thread since my data are the type:
> >>>>>>> 000001 f:q value=1
> >>>>>>> 000002 f:q value=2
> >>>>>>> 000003 f:q value=3
> >>>>>>> ...
> >>>>>>>
> >>>>>>> I add all the values and get the same result on a single scan than
> a
> >>>>>>> distributed, so, I guess that DistributedScan did well.
> >>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
> >>>>> remember,
> >>>>>>> but like 4x  of the scan time.
> >>>>>>> I'm not using any filter for the scans.
> >>>>>>>
> >>>>>>> This is the way I calculate number of regions/scans
> >>>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>>      List<RegionScanner> regionScanners = new
> >>>>>>> ArrayList<RegionScanner>();
> >>>>>>>      byte[] startKey;
> >>>>>>>      byte[] stopKey;
> >>>>>>>      HConnection connection = null;
> >>>>>>>      HBaseAdmin hbaseAdmin = null;
> >>>>>>>      try {
> >>>>>>>          connection =
> >>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>>          List<HRegionInfo> regions =
> >>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>>          RegionScanner regionScanner = null;
> >>>>>>>          for (HRegionInfo region : regions) {
> >>>>>>>
> >>>>>>>              startKey = region.getStartKey();
> >>>>>>>              stopKey = region.getEndKey();
> >>>>>>>
> >>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>>> scanConfiguration);
> >>>>>>>              // regionScanner = createRegionScanner(startKey,
> >>>>> stopKey);
> >>>>>>>              if (regionScanner != null) {
> >>>>>>>                  regionScanners.add(regionScanner);
> >>>>>>>              }
> >>>>>>>          }
> >>>>>>>
> >>>>>>> I did some test for a tiny table and I think that the range for
> each
> >>>>> scan
> >>>>>>> works fine. Although, I though that it was interesting that the
> time
> >>>>> when I
> >>>>>>> execute distributed scan is about 6x.
> >>>>>>>
> >>>>>>> I'm going to check about the hard disks, but I think that ti's
> right.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
> >>>>>>>
> >>>>>>>> Which version of HBase?
> >>>>>>>> Can you show us the code?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
> >>>>> single
> >>>>>>>> scan, which is suspicious because you say you have 6 regions.
> >>>>>>>> Are you sure you're not accidentally scanning all the data in each
> >> of
> >>>>>>>> your parallel scans?
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
> >>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>>>>>>> Subject: Scan vs Parallel scan.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I developed an distributed scan, I create an thread for each
> region.
> >>>>> After
> >>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
> >>>>> servers
> >>>>>>>> with 2 regions each one, in total there are 100.000 rows and
> >> execute a
> >>>>>>>> complete scan.
> >>>>>>>>
> >>>>>>>> My partitions are
> >>>>>>>> -01666 -> request 16665
> >>>>>>>> 016666-033332 -> request 16666
> >>>>>>>> 033332-049998 -> request 16666
> >>>>>>>> 049998-066664 -> request 16666
> >>>>>>>> 066664-083330 -> request 16666
> >>>>>>>> 083330- -> request 16671
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >>>>> PARALLEL:22089ms,Counter:2 ->
> >>>>>>>> Caching 10
> >>>>>>>>
> >>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >>>>> PARALJEL:16598ms,Counter:2 ->
> >>>>>>>> Caching 100
> >>>>>>>>
> >>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >>>>> PARALLEL:16497ms,Counter:2 ->
> >>>>>>>> Caching 1000
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> NORMAL:68288ms,Counter:2
> >>>>> ->
> >>>>>>>> Caching 1
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> NORMAL:2646ms,Counter:2
> >> ->
> >>>>>>>> Caching 100
> >>>>>>>>
> >>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> NORMAL:3903ms,Counter:2
> >> ->
> >>>>>>>> Caching 1000
> >>>>>>>>
> >>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
> >> why
> >>>>>>>> it's
> >>>>>>>> so fast,, it's really much faster than execute an "count" from
> hbase
> >>>>>>>> shell,
> >>>>>>>> what it doesn't look pretty notmal. The only time that it works
> >> better
> >>>>>>>> parallel is when I execute a normal scan with caching 1.
> >>>>>>>>
> >>>>>>>> Any clue about it?
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Scan vs Parallel scan.

Posted by Michael Segel <mi...@hotmail.com>.

Ok, lets again take a step back… 

So you are comparing your partial scan(s) against a full table scan? 

If I understood your question, you launch 3 partial scans where you set the start row and then end row of each scan, right? 

On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <ko...@gmail.com> wrote:

> Okay, then, the partial scan doesn't work as I think.
> How could it exceed the limit of a single region if I calculate the limits?
> 
> 
> The only bad point that I see it's that If a region server has three
> regions of the same table,  I'm executing three partial scans about this RS
> and they could compete for resources (network, etc..) on this node. It'd be
> better to have one thread for RS. But, that doesn't answer your questions.
> 
> I keep thinking...
> 
> 2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> 
>> Hi,
>> 
>> I wanted to take a step back from the actual code and to stop and think
>> about what you are doing and what HBase is doing under the covers.
>> 
>> So in your code, you are asking HBase to do 3 separate scans and then you
>> take the result set back and join it.
>> 
>> What does HBase do when it does a range scan?
>> What happens when that range scan exceeds a single region?
>> 
>> If you answer those questions… you’ll have your answer.
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>> 
>>> It's not all the code, I set things like these as well:
>>> scan.setMaxVersions();
>>> scan.setCacheBlocks(false);
>>> ...
>>> 
>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>>> 
>>>> yes, that is. I have changed the HBase version to 0.98
>>>> 
>>>> I got the start and stop keys with this method:
>>>> private List<RegionScanner> generatePartitions() {
>>>>       List<RegionScanner> regionScanners = new
>>>> ArrayList<RegionScanner>();
>>>>       byte[] startKey;
>>>>       byte[] stopKey;
>>>>       HConnection connection = null;
>>>>       HBaseAdmin hbaseAdmin = null;
>>>>       try {
>>>>           connection = HConnectionManager.
>>>> createConnection(HBaseConfiguration.create());
>>>>           hbaseAdmin = new HBaseAdmin(connection);
>>>>           List<HRegionInfo> regions =
>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>           RegionScanner regionScanner = null;
>>>>           for (HRegionInfo region : regions) {
>>>> 
>>>>               startKey = region.getStartKey();
>>>>               stopKey = region.getEndKey();
>>>> 
>>>>               regionScanner = new RegionScanner(startKey, stopKey,
>>>> scanConfiguration);
>>>>               // regionScanner = createRegionScanner(startKey,
>> stopKey);
>>>>               if (regionScanner != null) {
>>>>                   regionScanners.add(regionScanner);
>>>>               }
>>>>           }
>>>> 
>>>> And I execute the RegionScanner with this:
>>>> public List<Result> call() throws Exception {
>>>>       HConnection connection =
>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>       HTableInterface table =
>>>> connection.getTable(configuration.getTable());
>>>> 
>>>>   Scan scan = new Scan(startKey, stopKey);
>>>>       scan.setBatch(configuration.getBatch());
>>>>       scan.setCaching(configuration.getCaching());
>>>>       ResultScanner resultScanner = table.getScanner(scan);
>>>> 
>>>>       List<Result> results = new ArrayList<Result>();
>>>>       for (Result result : resultScanner) {
>>>>           results.add(result);
>>>>       }
>>>> 
>>>>       connection.close();
>>>>       table.close();
>>>> 
>>>>       return results;
>>>>   }
>>>> 
>>>> They implement Callable.
>>>> 
>>>> 
>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>>>> 
>>>>> Lets take a step back….
>>>>> 
>>>>> Your parallel scan is having the client create N threads where in each
>>>>> thread, you’re doing a partial scan of the table where each partial
>> scan
>>>>> takes the first and last row of each region?
>>>>> 
>>>>> Is that correct?
>>>>> 
>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I was checking a little bit more about,, I checked the cluster and
>> data
>>>>> is
>>>>>> store in three different regions servers, each one in a differente
>> node.
>>>>>> So, I guess the threads go to different hard-disks.
>>>>>> 
>>>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>>>> than
>>>>>> this implementation. I based on this implementation
>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>>>> 
>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>>>>>> 
>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>>>> although
>>>>>>> there is not difference.
>>>>>>> I disabled the table and disabled the blockcache for that family and
>> I
>>>>> put
>>>>>>> scan.setBlockcache(false) as well for both cases.
>>>>>>> 
>>>>>>> I think that it's not possible that I executing an complete scan for
>>>>> each
>>>>>>> thread since my data are the type:
>>>>>>> 000001 f:q value=1
>>>>>>> 000002 f:q value=2
>>>>>>> 000003 f:q value=3
>>>>>>> ...
>>>>>>> 
>>>>>>> I add all the values and get the same result on a single scan than a
>>>>>>> distributed, so, I guess that DistributedScan did well.
>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>>>> remember,
>>>>>>> but like 4x  of the scan time.
>>>>>>> I'm not using any filter for the scans.
>>>>>>> 
>>>>>>> This is the way I calculate number of regions/scans
>>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>>      List<RegionScanner> regionScanners = new
>>>>>>> ArrayList<RegionScanner>();
>>>>>>>      byte[] startKey;
>>>>>>>      byte[] stopKey;
>>>>>>>      HConnection connection = null;
>>>>>>>      HBaseAdmin hbaseAdmin = null;
>>>>>>>      try {
>>>>>>>          connection =
>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>>>>>>>          List<HRegionInfo> regions =
>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>>          RegionScanner regionScanner = null;
>>>>>>>          for (HRegionInfo region : regions) {
>>>>>>> 
>>>>>>>              startKey = region.getStartKey();
>>>>>>>              stopKey = region.getEndKey();
>>>>>>> 
>>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>>>>>>> scanConfiguration);
>>>>>>>              // regionScanner = createRegionScanner(startKey,
>>>>> stopKey);
>>>>>>>              if (regionScanner != null) {
>>>>>>>                  regionScanners.add(regionScanner);
>>>>>>>              }
>>>>>>>          }
>>>>>>> 
>>>>>>> I did some test for a tiny table and I think that the range for each
>>>>> scan
>>>>>>> works fine. Although, I though that it was interesting that the time
>>>>> when I
>>>>>>> execute distributed scan is about 6x.
>>>>>>> 
>>>>>>> I'm going to check about the hard disks, but I think that ti's right.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>>>>>>> 
>>>>>>>> Which version of HBase?
>>>>>>>> Can you show us the code?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>>>> single
>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>>>> Are you sure you're not accidentally scanning all the data in each
>> of
>>>>>>>> your parallel scans?
>>>>>>>> 
>>>>>>>> -- Lars
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
>>>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>>>> Subject: Scan vs Parallel scan.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I developed an distributed scan, I create an thread for each region.
>>>>> After
>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>>>> servers
>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>> execute a
>>>>>>>> complete scan.
>>>>>>>> 
>>>>>>>> My partitions are
>>>>>>>> -01666 -> request 16665
>>>>>>>> 016666-033332 -> request 16666
>>>>>>>> 033332-049998 -> request 16666
>>>>>>>> 049998-066664 -> request 16666
>>>>>>>> 066664-083330 -> request 16666
>>>>>>>> 083330- -> request 16671
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>>>> PARALLEL:22089ms,Counter:2 ->
>>>>>>>> Caching 10
>>>>>>>> 
>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>>>> PARALJEL:16598ms,Counter:2 ->
>>>>>>>> Caching 100
>>>>>>>> 
>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>>>> PARALLEL:16497ms,Counter:2 ->
>>>>>>>> Caching 1000
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>>>>> ->
>>>>>>>> Caching 1
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2
>> ->
>>>>>>>> Caching 100
>>>>>>>> 
>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2
>> ->
>>>>>>>> Caching 1000
>>>>>>>> 
>>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
>> why
>>>>>>>> it's
>>>>>>>> so fast,, it's really much faster than execute an "count" from hbase
>>>>>>>> shell,
>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>> better
>>>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>>>> 
>>>>>>>> Any clue about it?
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

Okay, then, the partial scan doesn't work as I think.
How could it exceed the limit of a single region if I calculate the limits?


The only bad point that I see it's that If a region server has three
regions of the same table,  I'm executing three partial scans about this RS
and they could compete for resources (network, etc..) on this node. It'd be
better to have one thread for RS. But, that doesn't answer your questions.

I keep thinking...

2014-09-12 9:40 GMT+02:00 Michael Segel <mi...@hotmail.com>:

> Hi,
>
> I wanted to take a step back from the actual code and to stop and think
> about what you are doing and what HBase is doing under the covers.
>
> So in your code, you are asking HBase to do 3 separate scans and then you
> take the result set back and join it.
>
> What does HBase do when it does a range scan?
> What happens when that range scan exceeds a single region?
>
> If you answer those questions… you’ll have your answer.
>
> HTH
>
> -Mike
>
> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> > It's not all the code, I set things like these as well:
> > scan.setMaxVersions();
> > scan.setCacheBlocks(false);
> > ...
> >
> > 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >
> >> yes, that is. I have changed the HBase version to 0.98
> >>
> >> I got the start and stop keys with this method:
> >> private List<RegionScanner> generatePartitions() {
> >>        List<RegionScanner> regionScanners = new
> >> ArrayList<RegionScanner>();
> >>        byte[] startKey;
> >>        byte[] stopKey;
> >>        HConnection connection = null;
> >>        HBaseAdmin hbaseAdmin = null;
> >>        try {
> >>            connection = HConnectionManager.
> >> createConnection(HBaseConfiguration.create());
> >>            hbaseAdmin = new HBaseAdmin(connection);
> >>            List<HRegionInfo> regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>            RegionScanner regionScanner = null;
> >>            for (HRegionInfo region : regions) {
> >>
> >>                startKey = region.getStartKey();
> >>                stopKey = region.getEndKey();
> >>
> >>                regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >>                // regionScanner = createRegionScanner(startKey,
> stopKey);
> >>                if (regionScanner != null) {
> >>                    regionScanners.add(regionScanner);
> >>                }
> >>            }
> >>
> >> And I execute the RegionScanner with this:
> >> public List<Result> call() throws Exception {
> >>        HConnection connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>        HTableInterface table =
> >> connection.getTable(configuration.getTable());
> >>
> >>    Scan scan = new Scan(startKey, stopKey);
> >>        scan.setBatch(configuration.getBatch());
> >>        scan.setCaching(configuration.getCaching());
> >>        ResultScanner resultScanner = table.getScanner(scan);
> >>
> >>        List<Result> results = new ArrayList<Result>();
> >>        for (Result result : resultScanner) {
> >>            results.add(result);
> >>        }
> >>
> >>        connection.close();
> >>        table.close();
> >>
> >>        return results;
> >>    }
> >>
> >> They implement Callable.
> >>
> >>
> >> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
> >>
> >>> Lets take a step back….
> >>>
> >>> Your parallel scan is having the client create N threads where in each
> >>> thread, you’re doing a partial scan of the table where each partial
> scan
> >>> takes the first and last row of each region?
> >>>
> >>> Is that correct?
> >>>
> >>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
> >>> wrote:
> >>>
> >>>> I was checking a little bit more about,, I checked the cluster and
> data
> >>> is
> >>>> store in three different regions servers, each one in a differente
> node.
> >>>> So, I guess the threads go to different hard-disks.
> >>>>
> >>>> If someone has an idea or suggestion.. why it's faster a single scan
> >>> than
> >>>> this implementation. I based on this implementation
> >>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>
> >>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >>>>
> >>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
> >>> although
> >>>>> there is not difference.
> >>>>> I disabled the table and disabled the blockcache for that family and
> I
> >>> put
> >>>>> scan.setBlockcache(false) as well for both cases.
> >>>>>
> >>>>> I think that it's not possible that I executing an complete scan for
> >>> each
> >>>>> thread since my data are the type:
> >>>>> 000001 f:q value=1
> >>>>> 000002 f:q value=2
> >>>>> 000003 f:q value=3
> >>>>> ...
> >>>>>
> >>>>> I add all the values and get the same result on a single scan than a
> >>>>> distributed, so, I guess that DistributedScan did well.
> >>>>> The count from the hbase shell takes about 10-15seconds, I don't
> >>> remember,
> >>>>> but like 4x  of the scan time.
> >>>>> I'm not using any filter for the scans.
> >>>>>
> >>>>> This is the way I calculate number of regions/scans
> >>>>> private List<RegionScanner> generatePartitions() {
> >>>>>       List<RegionScanner> regionScanners = new
> >>>>> ArrayList<RegionScanner>();
> >>>>>       byte[] startKey;
> >>>>>       byte[] stopKey;
> >>>>>       HConnection connection = null;
> >>>>>       HBaseAdmin hbaseAdmin = null;
> >>>>>       try {
> >>>>>           connection =
> >>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>           hbaseAdmin = new HBaseAdmin(connection);
> >>>>>           List<HRegionInfo> regions =
> >>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>           RegionScanner regionScanner = null;
> >>>>>           for (HRegionInfo region : regions) {
> >>>>>
> >>>>>               startKey = region.getStartKey();
> >>>>>               stopKey = region.getEndKey();
> >>>>>
> >>>>>               regionScanner = new RegionScanner(startKey, stopKey,
> >>>>> scanConfiguration);
> >>>>>               // regionScanner = createRegionScanner(startKey,
> >>> stopKey);
> >>>>>               if (regionScanner != null) {
> >>>>>                   regionScanners.add(regionScanner);
> >>>>>               }
> >>>>>           }
> >>>>>
> >>>>> I did some test for a tiny table and I think that the range for each
> >>> scan
> >>>>> works fine. Although, I though that it was interesting that the time
> >>> when I
> >>>>> execute distributed scan is about 6x.
> >>>>>
> >>>>> I'm going to check about the hard disks, but I think that ti's right.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
> >>>>>
> >>>>>> Which version of HBase?
> >>>>>> Can you show us the code?
> >>>>>>
> >>>>>>
> >>>>>> Your parallel scan with caching 100 takes about 6x as long as the
> >>> single
> >>>>>> scan, which is suspicious because you say you have 6 regions.
> >>>>>> Are you sure you're not accidentally scanning all the data in each
> of
> >>>>>> your parallel scans?
> >>>>>>
> >>>>>> -- Lars
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ________________________________
> >>>>>> From: Guillermo Ortiz <ko...@gmail.com>
> >>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>>>>> Subject: Scan vs Parallel scan.
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I developed an distributed scan, I create an thread for each region.
> >>> After
> >>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >>>>>> I have disabled blockcache in my table. My cluster has 3 region
> >>> servers
> >>>>>> with 2 regions each one, in total there are 100.000 rows and
> execute a
> >>>>>> complete scan.
> >>>>>>
> >>>>>> My partitions are
> >>>>>> -01666 -> request 16665
> >>>>>> 016666-033332 -> request 16666
> >>>>>> 033332-049998 -> request 16666
> >>>>>> 049998-066664 -> request 16666
> >>>>>> 066664-083330 -> request 16666
> >>>>>> 083330- -> request 16671
> >>>>>>
> >>>>>>
> >>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >>> PARALLEL:22089ms,Counter:2 ->
> >>>>>> Caching 10
> >>>>>>
> >>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >>> PARALJEL:16598ms,Counter:2 ->
> >>>>>> Caching 100
> >>>>>>
> >>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >>> PARALLEL:16497ms,Counter:2 ->
> >>>>>> Caching 1000
> >>>>>>
> >>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
> >>> ->
> >>>>>> Caching 1
> >>>>>>
> >>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2
> ->
> >>>>>> Caching 100
> >>>>>>
> >>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2
> ->
> >>>>>> Caching 1000
> >>>>>>
> >>>>>> Parallel scan works much worse than simple scan,, and I don't know
> why
> >>>>>> it's
> >>>>>> so fast,, it's really much faster than execute an "count" from hbase
> >>>>>> shell,
> >>>>>> what it doesn't look pretty notmal. The only time that it works
> better
> >>>>>> parallel is when I execute a normal scan with caching 1.
> >>>>>>
> >>>>>> Any clue about it?
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Scan vs Parallel scan.

Posted by Michael Segel <mi...@hotmail.com>.

Hi, 

I wanted to take a step back from the actual code and to stop and think about what you are doing and what HBase is doing under the covers. 

So in your code, you are asking HBase to do 3 separate scans and then you take the result set back and join it. 

What does HBase do when it does a range scan? 
What happens when that range scan exceeds a single region? 

If you answer those questions… you’ll have your answer. 

HTH

-Mike

On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <ko...@gmail.com> wrote:

> It's not all the code, I set things like these as well:
> scan.setMaxVersions();
> scan.setCacheBlocks(false);
> ...
> 
> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> 
>> yes, that is. I have changed the HBase version to 0.98
>> 
>> I got the start and stop keys with this method:
>> private List<RegionScanner> generatePartitions() {
>>        List<RegionScanner> regionScanners = new
>> ArrayList<RegionScanner>();
>>        byte[] startKey;
>>        byte[] stopKey;
>>        HConnection connection = null;
>>        HBaseAdmin hbaseAdmin = null;
>>        try {
>>            connection = HConnectionManager.
>> createConnection(HBaseConfiguration.create());
>>            hbaseAdmin = new HBaseAdmin(connection);
>>            List<HRegionInfo> regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>            RegionScanner regionScanner = null;
>>            for (HRegionInfo region : regions) {
>> 
>>                startKey = region.getStartKey();
>>                stopKey = region.getEndKey();
>> 
>>                regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>                // regionScanner = createRegionScanner(startKey, stopKey);
>>                if (regionScanner != null) {
>>                    regionScanners.add(regionScanner);
>>                }
>>            }
>> 
>> And I execute the RegionScanner with this:
>> public List<Result> call() throws Exception {
>>        HConnection connection =
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>        HTableInterface table =
>> connection.getTable(configuration.getTable());
>> 
>>    Scan scan = new Scan(startKey, stopKey);
>>        scan.setBatch(configuration.getBatch());
>>        scan.setCaching(configuration.getCaching());
>>        ResultScanner resultScanner = table.getScanner(scan);
>> 
>>        List<Result> results = new ArrayList<Result>();
>>        for (Result result : resultScanner) {
>>            results.add(result);
>>        }
>> 
>>        connection.close();
>>        table.close();
>> 
>>        return results;
>>    }
>> 
>> They implement Callable.
>> 
>> 
>> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>> 
>>> Lets take a step back….
>>> 
>>> Your parallel scan is having the client create N threads where in each
>>> thread, you’re doing a partial scan of the table where each partial scan
>>> takes the first and last row of each region?
>>> 
>>> Is that correct?
>>> 
>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
>>> wrote:
>>> 
>>>> I was checking a little bit more about,, I checked the cluster and data
>>> is
>>>> store in three different regions servers, each one in a differente node.
>>>> So, I guess the threads go to different hard-disks.
>>>> 
>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>> than
>>>> this implementation. I based on this implementation
>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>> 
>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>>>> 
>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>> although
>>>>> there is not difference.
>>>>> I disabled the table and disabled the blockcache for that family and I
>>> put
>>>>> scan.setBlockcache(false) as well for both cases.
>>>>> 
>>>>> I think that it's not possible that I executing an complete scan for
>>> each
>>>>> thread since my data are the type:
>>>>> 000001 f:q value=1
>>>>> 000002 f:q value=2
>>>>> 000003 f:q value=3
>>>>> ...
>>>>> 
>>>>> I add all the values and get the same result on a single scan than a
>>>>> distributed, so, I guess that DistributedScan did well.
>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>> remember,
>>>>> but like 4x  of the scan time.
>>>>> I'm not using any filter for the scans.
>>>>> 
>>>>> This is the way I calculate number of regions/scans
>>>>> private List<RegionScanner> generatePartitions() {
>>>>>       List<RegionScanner> regionScanners = new
>>>>> ArrayList<RegionScanner>();
>>>>>       byte[] startKey;
>>>>>       byte[] stopKey;
>>>>>       HConnection connection = null;
>>>>>       HBaseAdmin hbaseAdmin = null;
>>>>>       try {
>>>>>           connection =
>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>           hbaseAdmin = new HBaseAdmin(connection);
>>>>>           List<HRegionInfo> regions =
>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>           RegionScanner regionScanner = null;
>>>>>           for (HRegionInfo region : regions) {
>>>>> 
>>>>>               startKey = region.getStartKey();
>>>>>               stopKey = region.getEndKey();
>>>>> 
>>>>>               regionScanner = new RegionScanner(startKey, stopKey,
>>>>> scanConfiguration);
>>>>>               // regionScanner = createRegionScanner(startKey,
>>> stopKey);
>>>>>               if (regionScanner != null) {
>>>>>                   regionScanners.add(regionScanner);
>>>>>               }
>>>>>           }
>>>>> 
>>>>> I did some test for a tiny table and I think that the range for each
>>> scan
>>>>> works fine. Although, I though that it was interesting that the time
>>> when I
>>>>> execute distributed scan is about 6x.
>>>>> 
>>>>> I'm going to check about the hard disks, but I think that ti's right.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>>>>> 
>>>>>> Which version of HBase?
>>>>>> Can you show us the code?
>>>>>> 
>>>>>> 
>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>> single
>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>> Are you sure you're not accidentally scanning all the data in each of
>>>>>> your parallel scans?
>>>>>> 
>>>>>> -- Lars
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ________________________________
>>>>>> From: Guillermo Ortiz <ko...@gmail.com>
>>>>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>> Subject: Scan vs Parallel scan.
>>>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I developed an distributed scan, I create an thread for each region.
>>> After
>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>> servers
>>>>>> with 2 regions each one, in total there are 100.000 rows and execute a
>>>>>> complete scan.
>>>>>> 
>>>>>> My partitions are
>>>>>> -01666 -> request 16665
>>>>>> 016666-033332 -> request 16666
>>>>>> 033332-049998 -> request 16666
>>>>>> 049998-066664 -> request 16666
>>>>>> 066664-083330 -> request 16666
>>>>>> 083330- -> request 16671
>>>>>> 
>>>>>> 
>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>> PARALLEL:22089ms,Counter:2 ->
>>>>>> Caching 10
>>>>>> 
>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>> PARALJEL:16598ms,Counter:2 ->
>>>>>> Caching 100
>>>>>> 
>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>> PARALLEL:16497ms,Counter:2 ->
>>>>>> Caching 1000
>>>>>> 
>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>>> ->
>>>>>> Caching 1
>>>>>> 
>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>>>>>> Caching 100
>>>>>> 
>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>>>>>> Caching 1000
>>>>>> 
>>>>>> Parallel scan works much worse than simple scan,, and I don't know why
>>>>>> it's
>>>>>> so fast,, it's really much faster than execute an "count" from hbase
>>>>>> shell,
>>>>>> what it doesn't look pretty notmal. The only time that it works better
>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>> 
>>>>>> Any clue about it?
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

It's not all the code, I set things like these as well:
scan.setMaxVersions();
scan.setCacheBlocks(false);
...

2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> yes, that is. I have changed the HBase version to 0.98
>
> I got the start and stop keys with this method:
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection = HConnectionManager.
> createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> And I execute the RegionScanner with this:
> public List<Result> call() throws Exception {
>         HConnection connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
>         HTableInterface table =
> connection.getTable(configuration.getTable());
>
>     Scan scan = new Scan(startKey, stopKey);
>         scan.setBatch(configuration.getBatch());
>         scan.setCaching(configuration.getCaching());
>         ResultScanner resultScanner = table.getScanner(scan);
>
>         List<Result> results = new ArrayList<Result>();
>         for (Result result : resultScanner) {
>             results.add(result);
>         }
>
>         connection.close();
>         table.close();
>
>         return results;
>     }
>
> They implement Callable.
>
>
> 2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:
>
>> Lets take a step back….
>>
>> Your parallel scan is having the client create N threads where in each
>> thread, you’re doing a partial scan of the table where each partial scan
>> takes the first and last row of each region?
>>
>> Is that correct?
>>
>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com>
>> wrote:
>>
>> > I was checking a little bit more about,, I checked the cluster and data
>> is
>> > store in three different regions servers, each one in a differente node.
>> > So, I guess the threads go to different hard-disks.
>> >
>> > If someone has an idea or suggestion.. why it's faster a single scan
>> than
>> > this implementation. I based on this implementation
>> > https://github.com/zygm0nt/hbase-distributed-search
>> >
>> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
>> >
>> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>> although
>> >> there is not difference.
>> >> I disabled the table and disabled the blockcache for that family and I
>> put
>> >> scan.setBlockcache(false) as well for both cases.
>> >>
>> >> I think that it's not possible that I executing an complete scan for
>> each
>> >> thread since my data are the type:
>> >> 000001 f:q value=1
>> >> 000002 f:q value=2
>> >> 000003 f:q value=3
>> >> ...
>> >>
>> >> I add all the values and get the same result on a single scan than a
>> >> distributed, so, I guess that DistributedScan did well.
>> >> The count from the hbase shell takes about 10-15seconds, I don't
>> remember,
>> >> but like 4x  of the scan time.
>> >> I'm not using any filter for the scans.
>> >>
>> >> This is the way I calculate number of regions/scans
>> >> private List<RegionScanner> generatePartitions() {
>> >>        List<RegionScanner> regionScanners = new
>> >> ArrayList<RegionScanner>();
>> >>        byte[] startKey;
>> >>        byte[] stopKey;
>> >>        HConnection connection = null;
>> >>        HBaseAdmin hbaseAdmin = null;
>> >>        try {
>> >>            connection =
>> >> HConnectionManager.createConnection(HBaseConfiguration.create());
>> >>            hbaseAdmin = new HBaseAdmin(connection);
>> >>            List<HRegionInfo> regions =
>> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>> >>            RegionScanner regionScanner = null;
>> >>            for (HRegionInfo region : regions) {
>> >>
>> >>                startKey = region.getStartKey();
>> >>                stopKey = region.getEndKey();
>> >>
>> >>                regionScanner = new RegionScanner(startKey, stopKey,
>> >> scanConfiguration);
>> >>                // regionScanner = createRegionScanner(startKey,
>> stopKey);
>> >>                if (regionScanner != null) {
>> >>                    regionScanners.add(regionScanner);
>> >>                }
>> >>            }
>> >>
>> >> I did some test for a tiny table and I think that the range for each
>> scan
>> >> works fine. Although, I though that it was interesting that the time
>> when I
>> >> execute distributed scan is about 6x.
>> >>
>> >> I'm going to check about the hard disks, but I think that ti's right.
>> >>
>> >>
>> >>
>> >>
>> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>> >>
>> >>> Which version of HBase?
>> >>> Can you show us the code?
>> >>>
>> >>>
>> >>> Your parallel scan with caching 100 takes about 6x as long as the
>> single
>> >>> scan, which is suspicious because you say you have 6 regions.
>> >>> Are you sure you're not accidentally scanning all the data in each of
>> >>> your parallel scans?
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>> From: Guillermo Ortiz <ko...@gmail.com>
>> >>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>> Sent: Wednesday, September 10, 2014 1:40 AM
>> >>> Subject: Scan vs Parallel scan.
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I developed an distributed scan, I create an thread for each region.
>> After
>> >>> that, I've tried to get some times Scan vs DistributedScan.
>> >>> I have disabled blockcache in my table. My cluster has 3 region
>> servers
>> >>> with 2 regions each one, in total there are 100.000 rows and execute a
>> >>> complete scan.
>> >>>
>> >>> My partitions are
>> >>> -01666 -> request 16665
>> >>> 016666-033332 -> request 16666
>> >>> 033332-049998 -> request 16666
>> >>> 049998-066664 -> request 16666
>> >>> 066664-083330 -> request 16666
>> >>> 083330- -> request 16671
>> >>>
>> >>>
>> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>> PARALLEL:22089ms,Counter:2 ->
>> >>> Caching 10
>> >>>
>> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>> PARALJEL:16598ms,Counter:2 ->
>> >>> Caching 100
>> >>>
>> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>> PARALLEL:16497ms,Counter:2 ->
>> >>> Caching 1000
>> >>>
>> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2
>> ->
>> >>> Caching 1
>> >>>
>> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>> >>> Caching 100
>> >>>
>> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>> >>> Caching 1000
>> >>>
>> >>> Parallel scan works much worse than simple scan,, and I don't know why
>> >>> it's
>> >>> so fast,, it's really much faster than execute an "count" from hbase
>> >>> shell,
>> >>> what it doesn't look pretty notmal. The only time that it works better
>> >>> parallel is when I execute a normal scan with caching 1.
>> >>>
>> >>> Any clue about it?
>> >>>
>> >>
>> >>
>>
>>
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

yes, that is. I have changed the HBase version to 0.98

I got the start and stop keys with this method:
private List<RegionScanner> generatePartitions() {
        List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
        byte[] startKey;
        byte[] stopKey;
        HConnection connection = null;
        HBaseAdmin hbaseAdmin = null;
        try {
            connection = HConnectionManager.
createConnection(HBaseConfiguration.create());
            hbaseAdmin = new HBaseAdmin(connection);
            List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
            RegionScanner regionScanner = null;
            for (HRegionInfo region : regions) {

                startKey = region.getStartKey();
                stopKey = region.getEndKey();

                regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
                // regionScanner = createRegionScanner(startKey, stopKey);
                if (regionScanner != null) {
                    regionScanners.add(regionScanner);
                }
            }

And I execute the RegionScanner with this:
public List<Result> call() throws Exception {
        HConnection connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
        HTableInterface table =
connection.getTable(configuration.getTable());

    Scan scan = new Scan(startKey, stopKey);
        scan.setBatch(configuration.getBatch());
        scan.setCaching(configuration.getCaching());
        ResultScanner resultScanner = table.getScanner(scan);

        List<Result> results = new ArrayList<Result>();
        for (Result result : resultScanner) {
            results.add(result);
        }

        connection.close();
        table.close();

        return results;
    }

They implement Callable.


2014-09-12 9:26 GMT+02:00 Michael Segel <mi...@hotmail.com>:

> Lets take a step back….
>
> Your parallel scan is having the client create N threads where in each
> thread, you’re doing a partial scan of the table where each partial scan
> takes the first and last row of each region?
>
> Is that correct?
>
> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> > I was checking a little bit more about,, I checked the cluster and data
> is
> > store in three different regions servers, each one in a differente node.
> > So, I guess the threads go to different hard-disks.
> >
> > If someone has an idea or suggestion.. why it's faster a single scan than
> > this implementation. I based on this implementation
> > https://github.com/zygm0nt/hbase-distributed-search
> >
> > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> >
> >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> >> there is not difference.
> >> I disabled the table and disabled the blockcache for that family and I
> put
> >> scan.setBlockcache(false) as well for both cases.
> >>
> >> I think that it's not possible that I executing an complete scan for
> each
> >> thread since my data are the type:
> >> 000001 f:q value=1
> >> 000002 f:q value=2
> >> 000003 f:q value=3
> >> ...
> >>
> >> I add all the values and get the same result on a single scan than a
> >> distributed, so, I guess that DistributedScan did well.
> >> The count from the hbase shell takes about 10-15seconds, I don't
> remember,
> >> but like 4x  of the scan time.
> >> I'm not using any filter for the scans.
> >>
> >> This is the way I calculate number of regions/scans
> >> private List<RegionScanner> generatePartitions() {
> >>        List<RegionScanner> regionScanners = new
> >> ArrayList<RegionScanner>();
> >>        byte[] startKey;
> >>        byte[] stopKey;
> >>        HConnection connection = null;
> >>        HBaseAdmin hbaseAdmin = null;
> >>        try {
> >>            connection =
> >> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>            hbaseAdmin = new HBaseAdmin(connection);
> >>            List<HRegionInfo> regions =
> >> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>            RegionScanner regionScanner = null;
> >>            for (HRegionInfo region : regions) {
> >>
> >>                startKey = region.getStartKey();
> >>                stopKey = region.getEndKey();
> >>
> >>                regionScanner = new RegionScanner(startKey, stopKey,
> >> scanConfiguration);
> >>                // regionScanner = createRegionScanner(startKey,
> stopKey);
> >>                if (regionScanner != null) {
> >>                    regionScanners.add(regionScanner);
> >>                }
> >>            }
> >>
> >> I did some test for a tiny table and I think that the range for each
> scan
> >> works fine. Although, I though that it was interesting that the time
> when I
> >> execute distributed scan is about 6x.
> >>
> >> I'm going to check about the hard disks, but I think that ti's right.
> >>
> >>
> >>
> >>
> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
> >>
> >>> Which version of HBase?
> >>> Can you show us the code?
> >>>
> >>>
> >>> Your parallel scan with caching 100 takes about 6x as long as the
> single
> >>> scan, which is suspicious because you say you have 6 regions.
> >>> Are you sure you're not accidentally scanning all the data in each of
> >>> your parallel scans?
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Guillermo Ortiz <ko...@gmail.com>
> >>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>> Subject: Scan vs Parallel scan.
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I developed an distributed scan, I create an thread for each region.
> After
> >>> that, I've tried to get some times Scan vs DistributedScan.
> >>> I have disabled blockcache in my table. My cluster has 3 region servers
> >>> with 2 regions each one, in total there are 100.000 rows and execute a
> >>> complete scan.
> >>>
> >>> My partitions are
> >>> -01666 -> request 16665
> >>> 016666-033332 -> request 16666
> >>> 033332-049998 -> request 16666
> >>> 049998-066664 -> request 16666
> >>> 066664-083330 -> request 16666
> >>> 083330- -> request 16671
> >>>
> >>>
> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2
> ->
> >>> Caching 10
> >>>
> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2
> ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2
> ->
> >>> Caching 1000
> >>>
> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> >>> Caching 1
> >>>
> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> >>> Caching 100
> >>>
> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> >>> Caching 1000
> >>>
> >>> Parallel scan works much worse than simple scan,, and I don't know why
> >>> it's
> >>> so fast,, it's really much faster than execute an "count" from hbase
> >>> shell,
> >>> what it doesn't look pretty notmal. The only time that it works better
> >>> parallel is when I execute a normal scan with caching 1.
> >>>
> >>> Any clue about it?
> >>>
> >>
> >>
>
>

Re: Scan vs Parallel scan.

Posted by Michael Segel <mi...@hotmail.com>.

Lets take a step back…. 

Your parallel scan is having the client create N threads where in each thread, you’re doing a partial scan of the table where each partial scan takes the first and last row of each region? 

Is that correct? 

On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <ko...@gmail.com> wrote:

> I was checking a little bit more about,, I checked the cluster and data is
> store in three different regions servers, each one in a differente node.
> So, I guess the threads go to different hard-disks.
> 
> If someone has an idea or suggestion.. why it's faster a single scan than
> this implementation. I based on this implementation
> https://github.com/zygm0nt/hbase-distributed-search
> 
> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:
> 
>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
>> there is not difference.
>> I disabled the table and disabled the blockcache for that family and I put
>> scan.setBlockcache(false) as well for both cases.
>> 
>> I think that it's not possible that I executing an complete scan for each
>> thread since my data are the type:
>> 000001 f:q value=1
>> 000002 f:q value=2
>> 000003 f:q value=3
>> ...
>> 
>> I add all the values and get the same result on a single scan than a
>> distributed, so, I guess that DistributedScan did well.
>> The count from the hbase shell takes about 10-15seconds, I don't remember,
>> but like 4x  of the scan time.
>> I'm not using any filter for the scans.
>> 
>> This is the way I calculate number of regions/scans
>> private List<RegionScanner> generatePartitions() {
>>        List<RegionScanner> regionScanners = new
>> ArrayList<RegionScanner>();
>>        byte[] startKey;
>>        byte[] stopKey;
>>        HConnection connection = null;
>>        HBaseAdmin hbaseAdmin = null;
>>        try {
>>            connection =
>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>            hbaseAdmin = new HBaseAdmin(connection);
>>            List<HRegionInfo> regions =
>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>            RegionScanner regionScanner = null;
>>            for (HRegionInfo region : regions) {
>> 
>>                startKey = region.getStartKey();
>>                stopKey = region.getEndKey();
>> 
>>                regionScanner = new RegionScanner(startKey, stopKey,
>> scanConfiguration);
>>                // regionScanner = createRegionScanner(startKey, stopKey);
>>                if (regionScanner != null) {
>>                    regionScanners.add(regionScanner);
>>                }
>>            }
>> 
>> I did some test for a tiny table and I think that the range for each scan
>> works fine. Although, I though that it was interesting that the time when I
>> execute distributed scan is about 6x.
>> 
>> I'm going to check about the hard disks, but I think that ti's right.
>> 
>> 
>> 
>> 
>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>> 
>>> Which version of HBase?
>>> Can you show us the code?
>>> 
>>> 
>>> Your parallel scan with caching 100 takes about 6x as long as the single
>>> scan, which is suspicious because you say you have 6 regions.
>>> Are you sure you're not accidentally scanning all the data in each of
>>> your parallel scans?
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Guillermo Ortiz <ko...@gmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>> Subject: Scan vs Parallel scan.
>>> 
>>> 
>>> Hi,
>>> 
>>> I developed an distributed scan, I create an thread for each region. After
>>> that, I've tried to get some times Scan vs DistributedScan.
>>> I have disabled blockcache in my table. My cluster has 3 region servers
>>> with 2 regions each one, in total there are 100.000 rows and execute a
>>> complete scan.
>>> 
>>> My partitions are
>>> -01666 -> request 16665
>>> 016666-033332 -> request 16666
>>> 033332-049998 -> request 16666
>>> 049998-066664 -> request 16666
>>> 066664-083330 -> request 16666
>>> 083330- -> request 16671
>>> 
>>> 
>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
>>> Caching 10
>>> 
>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
>>> Caching 100
>>> 
>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
>>> Caching 1000
>>> 
>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
>>> Caching 1
>>> 
>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>>> Caching 100
>>> 
>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>>> Caching 1000
>>> 
>>> Parallel scan works much worse than simple scan,, and I don't know why
>>> it's
>>> so fast,, it's really much faster than execute an "count" from hbase
>>> shell,
>>> what it doesn't look pretty notmal. The only time that it works better
>>> parallel is when I execute a normal scan with caching 1.
>>> 
>>> Any clue about it?
>>> 
>> 
>>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I was checking a little bit more about,, I checked the cluster and data is
store in three different regions servers, each one in a differente node.
So, I guess the threads go to different hard-disks.

If someone has an idea or suggestion.. why it's faster a single scan than
this implementation. I based on this implementation
https://github.com/zygm0nt/hbase-distributed-search

2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <ko...@gmail.com>:

> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> there is not difference.
> I disabled the table and disabled the blockcache for that family and I put
> scan.setBlockcache(false) as well for both cases.
>
> I think that it's not possible that I executing an complete scan for each
> thread since my data are the type:
> 000001 f:q value=1
> 000002 f:q value=2
> 000003 f:q value=3
> ...
>
> I add all the values and get the same result on a single scan than a
> distributed, so, I guess that DistributedScan did well.
> The count from the hbase shell takes about 10-15seconds, I don't remember,
> but like 4x  of the scan time.
> I'm not using any filter for the scans.
>
> This is the way I calculate number of regions/scans
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> I did some test for a tiny table and I think that the range for each scan
> works fine. Although, I though that it was interesting that the time when I
> execute distributed scan is about 6x.
>
> I'm going to check about the hard disks, but I think that ti's right.
>
>
>
>
> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>
>> Which version of HBase?
>> Can you show us the code?
>>
>>
>> Your parallel scan with caching 100 takes about 6x as long as the single
>> scan, which is suspicious because you say you have 6 regions.
>> Are you sure you're not accidentally scanning all the data in each of
>> your parallel scans?
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Guillermo Ortiz <ko...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Sent: Wednesday, September 10, 2014 1:40 AM
>> Subject: Scan vs Parallel scan.
>>
>>
>> Hi,
>>
>> I developed an distributed scan, I create an thread for each region. After
>> that, I've tried to get some times Scan vs DistributedScan.
>> I have disabled blockcache in my table. My cluster has 3 region servers
>> with 2 regions each one, in total there are 100.000 rows and execute a
>> complete scan.
>>
>> My partitions are
>> -01666 -> request 16665
>> 016666-033332 -> request 16666
>> 033332-049998 -> request 16666
>> 049998-066664 -> request 16666
>> 066664-083330 -> request 16666
>> 083330- -> request 16671
>>
>>
>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
>> Caching 10
>>
>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
>> Caching 1000
>>
>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
>> Caching 1
>>
>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>> Caching 1000
>>
>> Parallel scan works much worse than simple scan,, and I don't know why
>> it's
>> so fast,, it's really much faster than execute an "count" from hbase
>> shell,
>> what it doesn't look pretty notmal. The only time that it works better
>> parallel is when I execute a normal scan with caching 1.
>>
>> Any clue about it?
>>
>
>

Re: Scan vs Parallel scan.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
there is not difference.
I disabled the table and disabled the blockcache for that family and I put
scan.setBlockcache(false) as well for both cases.

I think that it's not possible that I executing an complete scan for each
thread since my data are the type:
000001 f:q value=1
000002 f:q value=2
000003 f:q value=3
...

I add all the values and get the same result on a single scan than a
distributed, so, I guess that DistributedScan did well.
The count from the hbase shell takes about 10-15seconds, I don't remember,
but like 4x  of the scan time.
I'm not using any filter for the scans.

This is the way I calculate number of regions/scans
private List<RegionScanner> generatePartitions() {
        List<RegionScanner> regionScanners = new ArrayList<RegionScanner>();
        byte[] startKey;
        byte[] stopKey;
        HConnection connection = null;
        HBaseAdmin hbaseAdmin = null;
        try {
            connection =
HConnectionManager.createConnection(HBaseConfiguration.create());
            hbaseAdmin = new HBaseAdmin(connection);
            List<HRegionInfo> regions =
hbaseAdmin.getTableRegions(scanConfiguration.getTable());
            RegionScanner regionScanner = null;
            for (HRegionInfo region : regions) {

                startKey = region.getStartKey();
                stopKey = region.getEndKey();

                regionScanner = new RegionScanner(startKey, stopKey,
scanConfiguration);
                // regionScanner = createRegionScanner(startKey, stopKey);
                if (regionScanner != null) {
                    regionScanners.add(regionScanner);
                }
            }

I did some test for a tiny table and I think that the range for each scan
works fine. Although, I though that it was interesting that the time when I
execute distributed scan is about 6x.

I'm going to check about the hard disks, but I think that ti's right.




2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:

> Which version of HBase?
> Can you show us the code?
>
>
> Your parallel scan with caching 100 takes about 6x as long as the single
> scan, which is suspicious because you say you have 6 regions.
> Are you sure you're not accidentally scanning all the data in each of your
> parallel scans?
>
> -- Lars
>
>
>
> ________________________________
>  From: Guillermo Ortiz <ko...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Sent: Wednesday, September 10, 2014 1:40 AM
> Subject: Scan vs Parallel scan.
>
>
> Hi,
>
> I developed an distributed scan, I create an thread for each region. After
> that, I've tried to get some times Scan vs DistributedScan.
> I have disabled blockcache in my table. My cluster has 3 region servers
> with 2 regions each one, in total there are 100.000 rows and execute a
> complete scan.
>
> My partitions are
> -01666 -> request 16665
> 016666-033332 -> request 16666
> 033332-049998 -> request 16666
> 049998-066664 -> request 16666
> 066664-083330 -> request 16666
> 083330- -> request 16671
>
>
> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
> Caching 10
>
> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
> Caching 1000
>
> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
> Caching 1
>
> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
> Caching 100
>
> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
> Caching 1000
>
> Parallel scan works much worse than simple scan,, and I don't know why it's
> so fast,, it's really much faster than execute an "count" from hbase shell,
> what it doesn't look pretty notmal. The only time that it works better
> parallel is when I execute a normal scan with caching 1.
>
> Any clue about it?
>

Re: Scan vs Parallel scan.

Posted by lars hofhansl <la...@apache.org>.

Which version of HBase?
Can you show us the code?


Your parallel scan with caching 100 takes about 6x as long as the single scan, which is suspicious because you say you have 6 regions.
Are you sure you're not accidentally scanning all the data in each of your parallel scans?

-- Lars



________________________________
 From: Guillermo Ortiz <ko...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Wednesday, September 10, 2014 1:40 AM
Subject: Scan vs Parallel scan.
 

Hi,

I developed an distributed scan, I create an thread for each region. After
that, I've tried to get some times Scan vs DistributedScan.
I have disabled blockcache in my table. My cluster has 3 region servers
with 2 regions each one, in total there are 100.000 rows and execute a
complete scan.

My partitions are
-01666 -> request 16665
016666-033332 -> request 16666
033332-049998 -> request 16666
049998-066664 -> request 16666
066664-083330 -> request 16666
083330- -> request 16671


14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
Caching 10

14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
Caching 100

14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
Caching 1000

14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
Caching 1

14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
Caching 100

14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
Caching 1000

Parallel scan works much worse than simple scan,, and I don't know why it's
so fast,, it's really much faster than execute an "count" from hbase shell,
what it doesn't look pretty notmal. The only time that it works better
parallel is when I execute a normal scan with caching 1.

Any clue about it?