You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2020/08/24 21:42:00 UTC
[jira] [Comment Edited] (HBASE-24859) Remove the empty regions from the hbase mapreduce splits

    [ https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183608#comment-17183608 ] 

Andrew Kyle Purtell edited comment on HBASE-24859 at 8/24/20, 9:41 PM:
-----------------------------------------------------------------------

A concern I have about this proposal is we can exclude regions that are "empty" when calculating mapreduce splits, but then how do we know they are still empty when the MR tasks are finally launched? The short answer is we cannot know if they will still be empty. This is akin to in multithreaded programming when you check a condition, then take a lock, and then do something, unless you check again after taking the lock you can't know if something invalidated the condition between your check and lock acquisition. However there isn't a way to check again in the MR framework when launching tasks if you have excluded a region or regions when making the input split list, unless I am missing something. 

In HBase the unit of atomicity is the row. Users don't expect snapshot isolation unless they are explicitly arranging for it with timerange parameters or similar. If there are no timerange parameters given to a job, the expectation is the latest data committed to a row will be visible and available as soon as the operation returns a success result to the client, and concurrent scan activity will pick it up. So ignoring "empty" regions that are not actually empty when tasks should have run against them, but did not, can confound expectations and appear to be a failure to make committed data visible. 


was (Author: apurtell):
A concern I have about this proposal is we can exclude regions that are "empty" when calculating mapreduce splits, but then how do we know they are still empty when the MR tasks are finally launched? The short answer is we cannot know if they will still be empty. This is akin to in multithreaded programming when you check a condition, then take a lock, and then do something, unless you check again after taking the lock you can't know if something invalidated the condition between your check and lock acquisition. However there isn't a way to check again in the MR framework when launching tasks if you have excluded a region or regions when making the input split list, unless I am missing something. 

In HBase the unit of atomicity is the row. Users don't expect snapshot isolation unless they are explicitly arranging for it with timerange parameters or similar. So ignoring "empty" regions that are not actually empty when tasks should have run against them, but did not, can confound expectations and appear to be a failure to make committed data visible. 

> Remove the empty regions from the hbase mapreduce splits
> --------------------------------------------------------
>
>                 Key: HBASE-24859
>                 URL: https://issues.apache.org/jira/browse/HBASE-24859
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Major
>
> It has been observed that when the table has too many regions, MR jobs consume more memory in the client. This is because we keep the region level information in memory and the memory heavy object is TableSplit because of Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level information if the region is empty. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)