You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2009/10/26 07:02:00 UTC

[jira] Created: (HBASE-1935) Scan in parallel

Scan in parallel
----------------

                 Key: HBASE-1935
                 URL: https://issues.apache.org/jira/browse/HBASE-1935
             Project: Hadoop HBase
          Issue Type: New Feature
            Reporter: stack


A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770304#action_12770304 ] 

stack commented on HBASE-1935:
------------------------------

Would be sweet if we could get parallel into stock HTable.  Also if cross-over with multiget, multidelete, lets make single system.

Agree on how to handle errors during batch puts/gets.

I like the idea of supporting both in-order and out-of-order.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Fix Version/s:     (was: 0.20.3)
                   0.21.0

Moving to 0.21.  Lets look at merging this functionality back up into HTable rather than have it in a class of its own.  Also, consider doing this functionality in coprocessors if it makes sense.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>             Fix For: 0.21.0
>
>         Attachments: pscanner-v2.patch, pscanner-v3.patch, pscanner-v4.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Attachment: pscanner-v2.patch

Here's a v2 that adds the ability to parallel scan when either/both the startRow and stopRow are specified.  There are also more tests included...

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner-v2.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Fix Version/s: 0.20.3

I'll commit this soon unless objection.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>             Fix For: 0.20.3
>
>         Attachments: pscanner-v2.patch, pscanner-v3.patch, pscanner-v4.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Attachment: pscanner-v4.patch

More fixes and more tests.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner-v2.patch, pscanner-v3.patch, pscanner-v4.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1935) Scan in parallel

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771103#action_12771103 ] 

Dan Washusen commented on HBASE-1935:
-------------------------------------

re. out-of-order receipt of results

What do you see as the benefits in parallel scanning with results in order?

The 'RegionCallable' defined at line 3109 of the patch opens a scanner on a specific region server.  The same scanner is then used for all results returned from that region.  If you wanted to receive results in-order the time saved would be;
* The time taken to switch from one region to the next.  For example, while iterating over results from region 1 you could start fetching results from region 2.
* The time spent by the client iterating over the results returned in that batch before asking the server side scanner for the next batch.

re. startRow and endRow restrictions

The ParallelHTable in this patch (line 3608) falls back to a sequential scan if the scan has a startRow or endRow defined.  It should be possible to use the parallel scanner with out-of-order receipt of results if either of these values are specified.  The scanner could list all regions and for each region see if it's startKey and endKey fall within the scan's startRow and endRow.  If it does scan it.

I'm probably stating the obvious with both those points but I'm new to HBase so you'll have to forgive me. :)

Cheers,
Dan


> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1935) Scan in parallel

Posted by "Dan Washusen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783985#action_12783985 ] 

Dan Washusen commented on HBASE-1935:
-------------------------------------

There is a minor bug in v2 of the patch.  The logic in the ParallelScannerManager to determine if a scan is interested in a region doesn't handle the case when there is only one region.  

The following fixes it:
{code}
Set<HRegionInfo> regions = table.getRegionsInfo().keySet();
for (HRegionInfo region : regions) {
  ...
  boolean isScanInterestedInRegion = (scan.getStartRow().length == 0 && scan.getStopRow().length == 0) || regions.size() == 1;
{code}

          

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner-v2.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Attachment: pscanner-v3.patch

v3 of parallel scanner.  Includes Dan's fix suggested above.  Also It fixes issues with the logic that determines if a scan is interested in a region.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner-v2.patch, pscanner-v3.patch, pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1935) Scan in parallel

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770087#action_12770087 ] 

Jonathan Gray commented on HBASE-1935:
--------------------------------------

Went over the patch twice.  Looks pretty good.

There is some cross-over with work done in Multi operations (MultiGet, MultiDelete, etc..).  I think the first thing to decide is if we want to create some unified threading system or take passed-in ExecutorServices as is done with the patch.  And do we need a special ParallelHTable, or should the normal HTable support threading?  I believe the latter.

At either the HCM or HTable level, I think we should have a local, bounded ExecutorService pool.  You would be able to modify its size through the constructor, but default settings would come from something in the conf like hbase.client.threads.

One thing I do like (at least for early versions of threaded clients) is just failing immediately when encountering a problem like a split.  Properly handling this is one of the hardest parts about this (and other things like stateful filters), and retries are tricky and imperfect.  With batched/parallel reads (get or scan) we should just fail-fast and throw exceptions to let the client deal.  With batched/parallel writes (put or delete) we should process what we can and return back to the client what was not completed.

Another thing I'm a little confused about... this seems to be designed for completely out-of-order receipt of results.  Rather than aggregating up a list of Futures, and then waiting for them to complete in order, this uses a ExecutorCompletionService which returns things as they finish.  I can see in certain use cases this would make sense, but is a bit more limited.  However, I don't see why we can't support both using two different task completion-waiting paths and with very small changes to the constructor APIs.

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1935) Scan in parallel

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1935:
-------------------------

    Attachment: pscanner.patch

Here is a first attempt.  Maybe someone would like to take it on?  It runs scanners against multiple regions concurrently and then aggregates the results. Includes a unit test but needs convertion to new style client-side test (Only two of the tests in the unit test are for parallel scanning).

> Scan in parallel
> ----------------
>
>                 Key: HBASE-1935
>                 URL: https://issues.apache.org/jira/browse/HBASE-1935
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>         Attachments: pscanner.patch
>
>
> A scanner that rather than scan in series, instead scanned multiple regions in parallell would be more involved but could complete much faster partiularly if results are sparse.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.