You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2013/08/23 20:51:56 UTC

[jira] [Comment Edited] (HBASE-9272) A simple parallel, unordered scanner

    [ https://issues.apache.org/jira/browse/HBASE-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748853#comment-13748853 ] 

Lars Hofhansl edited comment on HBASE-9272 at 8/23/13 6:51 PM:
---------------------------------------------------------------

Some more data:
30m rows, 2 CFs, 5 columns each, with 100 bytes values. Split into 128 regions

When all data is returned - this is limited by what the client can consume (via the network and by actually iterating over the result). All numbers in seconds:
||ClientScanner||1 thread||2 threads||5 threads||10 threads||50 threads||
|519|529|303|192|189|187|

When all is filtered with a ValueFilter on the server (as in an analytics query):
||ClientScanner||1 thread||2 threads||5 threads||10 threads||50 threads||
|53.3|53.3|28.4|11.6|6.42|1.88|

                
      was (Author: lhofhansl):
    Some more data:
30m rows, 2 CFs, 5 columns each, with 100 bytes values. Split into 128 regions

When all data is returned - this is limited by what the client can consume (via the network and by actually iterating over the result). All numbers in seconds:
||ClientScanner||1 thread||2 threads||5 threads||10 threads||50 threads||
|519|529|303|192|189|187|

When all is filtered with a ValueFilter on the server (as in an analytics query):
||ClientScanner||1 thread||5 threads||10 threads||50 threads||
|53.3|53.3|28.4|11.6|6.42|1.88|

                  
> A simple parallel, unordered scanner
> ------------------------------------
>
>                 Key: HBASE-9272
>                 URL: https://issues.apache.org/jira/browse/HBASE-9272
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>            Priority: Minor
>         Attachments: ParallelClientScanner.java, ParallelClientScanner.java
>
>
> The contract of ClientScanner is to return rows in sort order. That limits the order in which region can be scanned.
> I propose a simple ParallelScanner that does not have this requirement and queries regions in parallel, return whatever gets returned first.
> This is generally useful for scans that filter a lot of data on the server, or in cases where the client can very quickly react to the returned data.
> I have a simple prototype (doesn't do error handling right, and might be a bit heavy on the synchronization side - it used a BlockingQueue to hand data between the client using the scanner and the threads doing the scanning, it also could potentially starve some scanners long enugh to time out at the server).
> On the plus side, it's only a 130 lines of code. :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira