You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Lars George (JIRA)" <ji...@apache.org> on 2009/09/11 18:26:58 UTC

[jira] Created: (HBASE-1829) Make use of start/stop row in TableInputFormat

Make use of start/stop row in TableInputFormat
----------------------------------------------

                 Key: HBASE-1829
                 URL: https://issues.apache.org/jira/browse/HBASE-1829
             Project: Hadoop HBase
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.20.0
            Reporter: Lars George
            Assignee: Lars George
            Priority: Minor
             Fix For: 0.20.1


Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774971#action_12774971 ] 

Lars George commented on HBASE-1829:
------------------------------------

Oh, 0.20 branch has old JUnit lib. Hrmm, then we can use it as is as test in trunk is testing it and since these versions are quite close to each other I trust it works as expected. So +1 from me for branch.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756271#action_12756271 ] 

stack commented on HBASE-1829:
------------------------------

Lars, tell us more about this patch... what it does.  It looks like a nice change in that if you pass a start/stop row to a Scan, only the regions that contain those start/stop rows will have splits made for them.

It looks too like you are cleaning up some weird crap; i.e.:

{code}
-    int realNumSplits = startKeys.length;
-    InputSplit[] splits = new InputSplit[realNumSplits];
-    int middle = startKeys.length / realNumSplits;
{code}


Is this right?

{code}
+      if (kvc.compare(startRow, keys.getSecond()[i]) <= 0 &&
+          kvc.compare(stopRow, keys.getFirst()[i]) >= 0) { 
{code}

Regions do not include their end-key (exclusive).

Its hard to test this but I gave it a go.  Seems like it hasn't broken anything (smile).

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775987#action_12775987 ] 

Lars George commented on HBASE-1829:
------------------------------------

v5 patch is on top of v4 patch, in other words v5 only has the diff after v4 is applied. Does that make sense?

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829-v5.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774841#action_12774841 ] 

stack commented on HBASE-1829:
------------------------------

I applied TRUNK.  Shall I apply all but the tests to 0.20 (The tests won't work in 0.20 branch).  I think it should be fine.  The default behavior is unchanged.  Its only if you provide start/stop rows that behavior changes.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: HBASE-1829-v5.patch

Patch HBASE-1829-v5 only slightly changes the Test class to test for proper last rows seen. Same outcome (test succeeds) but more correct.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829-v5.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764317#action_12764317 ] 

Lars George commented on HBASE-1829:
------------------------------------

Hey Ken,

I assumed that it will only really send in the rows between start row (inclusive) and stop row (exclusive) because the TIF uses the Scan instance to scan the actual table and setting these two values should enforce the boundaries.

I was travelling the last few days and did not get much done. I was in the process of adding a unit test for the change that should show that it selects the right regions as well as enforce the start/stop row boundaries. I will see that I get that done asap. If it passes it is all ready to go. 

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1829:
-------------------------

         Priority: Major  (was: Minor)
    Fix Version/s: 0.20.2

Upped priority.  This is a nice feature.  Would be good to get it into 0.20.2 even (Marked it for there for now).

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: HBASE-1829.patch

HBASE-1829.patch implements the split filtering by start and stop row. Not sure yet about the stopRow being empty in a compare. I will test this later and fix accordingly.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774700#action_12774700 ] 

stack commented on HBASE-1829:
------------------------------

Patch looks great Lars.  It looks like you forgot to add TestTableInputFormatScan.java to v3 of your patch.  Is that right?  If so, I can just apply v3 and add TestTableInputFormatScan.java?

What you think we should do for 0.20?  Just not add it to the branch?  Tests depend on the new stuff?

Thanks.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1829:
-------------------------

    Fix Version/s:     (was: 0.20.1)
                   0.21.0

@Lars good stuff.. moving it out

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775090#action_12775090 ] 

stack commented on HBASE-1829:
------------------------------

Applied to branch too... w/o tests.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: HBASE-1829-v4.patch

Sorry Michael, forgot to do a "svn add" for the new class :( 

HBASE-1829-v4 patch has new class included in addition to v3 patch content. Sorry for the confusion.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: TestTableInputFormatScan.java
                HBaseTestingUtility.java

I have updated the test to JUnit4 and added a few helpful functions to the test utility class, like starting a MapReduce cluster. What I have trouble with now is adding the multi region code. I have added a createMultiRegion(...) method but that previously used some intrinsic knowledge of how to set up the regions and fill with rows in between. Since I am not sure if I have to copy that all I stopped short and added a few "FIXME" comments in the utility class where I am not sure what to do without creating a possibly a mess.

Could someone take my two classes and let me know what I have to do? After that I will add the appropriate test cases covering all region combinations. I have two already and they complete successfully - but since the table has only one region it makes no sense yet.

Check the @BeforeClass to see how I intended to use them. Maybe that also needs slight adjustments?

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: HBASE-1829-v2.patch

HBASE-1829-v2.patch fixes error Stack got above. Comparing cannot be done with KeyValue comparators but Bytes.compareTo() directly. Also fixed issues with start and stop row. Scan by default has a value assigned to them and does not return "null". Comparing to HConstants.EMPTY_START_ROW for example did not work, so I chose checking the length instead.

Tested with a real data set and it works for subrange and full scan. Will still add unit test. Just thought I up the patch here for progress tracking.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764318#action_12764318 ] 

Lars George commented on HBASE-1829:
------------------------------------

No, you are right Ken, in createRecordReader() it does:

{code}
Scan sc = new Scan(this.scan);
sc.setStartRow(tSplit.getStartRow());
sc.setStopRow(tSplit.getEndRow());
trr.setScan(sc);
trr.setHTable(table);
trr.init();
{code}

which sets the boundaries to the current split, while not honoring the set start/stop row. I will have to add another row key comparison to set it to the appropriate keys. I think it should be enough to check like this:

{code}
if (scan.getStartRow().length == 0) sc.setStartRow(tSplit.getStartRow());
if (scan.getStopRow().length == 0) sc.setStopRow(tSplit.getEndRow());
{code}

Right? I'll will check this when doing the unit tests.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Attachment: HBASE-1829-v3.patch

OK, after pretty much two days of getting up to speed with how the regions mechanism works I had to add a hook to flush the region cache in HConnection. WIth that I was able to recreate the old MultiRegion functionality in the new HBaseTestingUtility. 

I have added 11 subtests that cover all combinations of empty or not empty start and stop rows as well as single region to spanning many regions scans. All succeed, but please someone review the big "if" statement in TableInputFormatBase.getSplits(). I want to make sure I have that right. The tests say yes, but a second pair of eyes is appreciated.


> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars George updated HBASE-1829:
-------------------------------

    Status: Patch Available  (was: Open)

HBASE-1829-v3 adds code and tests.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756385#action_12756385 ] 

Lars George commented on HBASE-1829:
------------------------------------

You are right Michael, it cleans up some remnants from when we could have different numbers of splits. It also attempts to reduce the split count to the number of regions that include start and stop row. The idea with the comparison is to find the start key of the region just below the start row and the end key of the region just after the stop row. 

I am not sure about the default empty end row and also the comparison in terms of equal or equal and greater etc. I just thought I get the patch up as an idea I had but it is not yet tested. I will test it early next week an sort out the issues.

Question is there a testbed that allows to have say 3-4 regions so that I can construct various test cases (like start/stop row both in first/last region, spanning all regions, crossing only two regions etc.)? I am not too familiar with the test classes and I know you guys changing things around. What would be a good sample to start with?

Otherwise I will test it on my live cluster that has more than enough to test with. But a unit test seems like a good idea.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Ken Weiner (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764260#action_12764260 ] 

Ken Weiner commented on HBASE-1829:
-----------------------------------

Does this patch guarantee that the only rows passed to the mapper are the ones that fall between the start and stop rows specified in the Scan?  Or will I get rows that fall before the start row within the first split and some rows that fall after the end row within the last split?

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756273#action_12756273 ] 

stack commented on HBASE-1829:
------------------------------

Or, rather, I got this when I tried the patch (RowCounter on a table loaded up with PE):

{code}
09/09/16 15:56:23 INFO zookeeper.ClientCnxn: Server connection successful
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
	at org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:1722)
	at org.apache.hadoop.hbase.KeyValue$KeyComparator.compare(KeyValue.java:1758)
	at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:296)
	at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at org.apache.hadoop.hbase.mapreduce.RowCounter.main(RowCounter.java:125)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:41)

{code}

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776078#action_12776078 ] 

stack commented on HBASE-1829:
------------------------------

Applied v5.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829-v5.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756651#action_12756651 ] 

stack commented on HBASE-1829:
------------------------------

Lars: I'd judge this a nice-to-have rather than a required for 0.20.1.  Should we move it out?

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1829:
-------------------------

      Resolution: Fixed
    Release Note: Splits keep within the confines of star and end rows if provided.
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Applied to trunk and branch.  Thanks for the patch Lars.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: HBASE-1829-v2.patch, HBASE-1829-v3.patch, HBASE-1829-v4.patch, HBASE-1829.patch, HBaseTestingUtility.java, TestTableInputFormatScan.java
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756599#action_12756599 ] 

Jonathan Gray commented on HBASE-1829:
--------------------------------------

There is some interesting stuff in TestTableMapReduce which extends MultiRegionTable.

Rather simply you can just insert a bunch of sequential rows and run manual splits to create multiple regions.  There's a unit test out there that does that nicely I forget which one.  But by knowing what the split points will be, will be pretty easy to at least test the algorithm.

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1829) Make use of start/stop row in TableInputFormat

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757855#action_12757855 ] 

Lars George commented on HBASE-1829:
------------------------------------

Michael, yes, that can be pushed out, just thought I dump it here so I can work on it and get comments early.

Jon, thanks for the pointer, I will add a unit test for it this week and test accordingly. 

> Make use of start/stop row in TableInputFormat
> ----------------------------------------------
>
>                 Key: HBASE-1829
>                 URL: https://issues.apache.org/jira/browse/HBASE-1829
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Lars George
>            Assignee: Lars George
>            Priority: Minor
>             Fix For: 0.20.1
>
>         Attachments: HBASE-1829.patch
>
>
> Since we can now specify a start and stop row with the Scan that is handed to the TIF we can reduce the splits to the regions that contain these rows. That allows to test large MR jobs on a single region for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.