You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2013/09/07 00:32:52 UTC

[jira] [Updated] (HBASE-9272) A simple parallel, unordered scanner

     [ https://issues.apache.org/jira/browse/HBASE-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-9272:
---------------------------------

    Attachment: 9272-0.94.txt

So here's a sample patch against 0.94. It does the following:
# An API to parallelize a single Scan.
# Round robin across RegionServers
# Builds its own task queue in order not to rely on a specifically configured thread pool (i.e. the HTable's pool can be used)
# explores ways of automated scaling. The parallelism is controlled by a scaling factor that takes the number of a region server touched by the scan into account
# An alternate API where the caller can pass in a set of Splits (in form of Scans) and then those are executed on the pool
# limits all thread synchronization to the a BlockingQueue, which (in theory) allows the reader and the writer to lock independently
# to avoid other synchronization, marker objects are passed to indicate when the thread is done or encountered an exception
# Also hooked this up with HTable (which is the only questionable - IMHO - part of this, since it changes HTableInterface and could break client application that directly implement HTableInterface). This part is not strictly needed, ParallelClientScanner can be used on its own.
# Pushes a bit more common code into AbstractClientScanner.

Please let me know what you think. If direction is good I'll add tests and make a trunk patch.
                
> A simple parallel, unordered scanner
> ------------------------------------
>
>                 Key: HBASE-9272
>                 URL: https://issues.apache.org/jira/browse/HBASE-9272
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>            Priority: Minor
>         Attachments: 9272-0.94.txt, ParallelClientScanner.java, ParallelClientScanner.java
>
>
> The contract of ClientScanner is to return rows in sort order. That limits the order in which region can be scanned.
> I propose a simple ParallelScanner that does not have this requirement and queries regions in parallel, return whatever gets returned first.
> This is generally useful for scans that filter a lot of data on the server, or in cases where the client can very quickly react to the returned data.
> I have a simple prototype (doesn't do error handling right, and might be a bit heavy on the synchronization side - it used a BlockingQueue to hand data between the client using the scanner and the threads doing the scanning, it also could potentially starve some scanners long enugh to time out at the server).
> On the plus side, it's only a 130 lines of code. :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira