You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org> on 2010/09/29 02:20:32 UTC

[jira] Created: (HBASE-3048) unify code for major/minor compactions

unify code for major/minor compactions
--------------------------------------

                 Key: HBASE-3048
                 URL: https://issues.apache.org/jira/browse/HBASE-3048
             Project: HBase
          Issue Type: Improvement
            Reporter: Kannan Muthukkaruppan


Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.

Unifying minor compactions to do the same thing as major compactions has other advantages:

* If the same data is overwritten several times and we are not processing overwrites, it makes each subsequent minor compaction more expensive as the total amount of data.

* We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.

-

Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Pranav Khaitan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917300#action_12917300 ] 

Pranav Khaitan commented on HBASE-3048:
---------------------------------------

This can be done using the following simple change:
Currently, MajorCompaction uses StoreScanner while MinorCompaction uses MinorCompactionStoreScanner which does a subset of things done by StoreScanner. We now want MinorCompaction to do most of the things being done by StoreScanner. Therefore, I would suggest making a flag in the StoreScanner which says if deletes should be processed or not. This flag will be ON by default and will only be set to OFF during a minor compaction. The only alternative to this idea would result in duplicating code which doesn't seem like a good idea.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929168#action_12929168 ] 

HBase Review Board commented on HBASE-3048:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1185/
-----------------------------------------------------------

Review request for hbase, stack, Kannan Muthukkaruppan, and Ryan Rawson.


Summary
-------

This is a rebased patch from Amit.  This was internally reviewed but we may want some minor modifications for 0.90 with the new InternalScanner, if possible.  (suggestion from ryan on jira)


This addresses bug HBASE-3048.
    http://issues.apache.org/jira/browse/HBASE-3048


Diffs
-----

  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MinorCompactingStoreScanner.java 1031745 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 1031745 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1031745 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 1031745 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java 1031745 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestMinorCompactingStoreScanner.java 1031745 

Diff: http://review.cloudera.org/r/1185/diff


Testing
-------

Lot's of good new additions by Amit to TestCompaction which passes.


Thanks,

Jonathan




> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-3048) unify code for major/minor compactions

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-3048:
----------------------------

    Assignee: Amitanand Aiyer

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amitanand Aiyer updated HBASE-3048:
-----------------------------------

    Status: Patch Available  (was: Open)

Hey Guys,
  Here is an initial diff with the changes to implement the unification. Would like to add in some more changes to the TestCompaction code to make it more exhaustive.

  If you have any comments/suggestions that would be great!

Thanks,
-Amit


> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916006#action_12916006 ] 

stack commented on HBASE-3048:
------------------------------

This is fine by me.  The one objection I was going to raise was the delete markers story but you got that in your footnote.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same data is overwritten several times and we are not processing overwrites, it makes each subsequent minor compaction more expensive as the total amount of data.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930339#action_12930339 ] 

stack commented on HBASE-3048:
------------------------------

@Kannan Thanks for reporting back.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3048-0.90-v2.patch, HBASE-3048-v3.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926165#action_12926165 ] 

Amitanand Aiyer commented on HBASE-3048:
----------------------------------------

Hi Stack, can you try again. I've uploaded it again now.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926169#action_12926169 ] 

stack commented on HBASE-3048:
------------------------------

Ohh, radical.

All tests pass?

Patch looks good to me.   I like the unit tests that assert stuff works as it used to (Thats what you are doing as I read this, right?)

Good stuff Amit. 

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926508#action_12926508 ] 

ryan rawson commented on HBASE-3048:
------------------------------------

The code is using an additional flag to the StoreScanner constructor, perhaps we should use the InternalScan infrastructure added by HBASE-3082 to keep this as clean as possible?  That seems to be the primary criticism of this patch I think.  The rest looks good, perhaps instead of "includeDeletes" it should be called "keepDeletesInOutput" or something really descriptive like that?  Write one read many and all that.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929315#action_12929315 ] 

Kannan Muthukkaruppan commented on HBASE-3048:
----------------------------------------------

+1 on patch. Looks good to me.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926163#action_12926163 ] 

stack commented on HBASE-3048:
------------------------------

@Amit Did you attach a patch?  I don't see it.  Thanks.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917359#action_12917359 ] 

Kannan Muthukkaruppan commented on HBASE-3048:
----------------------------------------------

Hi Pranav! yes, the flag approach should work well. The intention is to unify not only the processing details but the code/logic too. 

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3048:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.90.0
     Release Note: minor compactions now enforce ttl, maxversions, deletes, etc... but will let through the delete markers themselves
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed to trunk.  Thanks for the great work Amit.  Thanks for reviews from Ryan and Kannan.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3048-0.90-v2.patch, HBASE-3048-v3.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amitanand Aiyer updated HBASE-3048:
-----------------------------------

    Attachment: unify.patch

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926330#action_12926330 ] 

Kannan Muthukkaruppan commented on HBASE-3048:
----------------------------------------------

Amit,

This overload of StoreScanner which takes a set of columns:

{code}
  StoreScanner(Store store, Scan scan, final NavigableSet<byte[]> columns,
                              boolean includeDeletes) throws IOException {
{code}

doesn't seem to be ever called with anything other than "false". So, perhaps no need to introduce this.

You can instead have the old method:

{code}
  StoreScanner(Store store, Scan scan, final NavigableSet<byte[]> columns) 
                              throws IOException {
{code}

simply call ScanQueryMatcher with "false" for includeDeletes.




> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929449#action_12929449 ] 

HBase Review Board commented on HBASE-3048:
-------------------------------------------

Message from: "Ryan Rawson" <ry...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1185/#review1837
-----------------------------------------------------------

Ship it!


ok lets go with this

- Ryan





> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929261#action_12929261 ] 

HBase Review Board commented on HBASE-3048:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1185/
-----------------------------------------------------------

(Updated 2010-11-06 15:50:30.615330)


Review request for hbase, stack, Kannan Muthukkaruppan, and Ryan Rawson.


Changes
-------

Minor cleanup of whitespace and stuff.


Summary
-------

This is a rebased patch from Amit.  This was internally reviewed but we may want some minor modifications for 0.90 with the new InternalScanner, if possible.  (suggestion from ryan on jira)


This addresses bug HBASE-3048.
    http://issues.apache.org/jira/browse/HBASE-3048


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MinorCompactingStoreScanner.java 1032175 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java 1032175 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java 1032175 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java 1032175 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestCompaction.java 1032175 
  trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestMinorCompactingStoreScanner.java 1032175 

Diff: http://review.cloudera.org/r/1185/diff


Testing
-------

Lot's of good new additions by Amit to TestCompaction which passes.


Thanks,

Jonathan




> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3048:
---------------------------------

    Attachment: HBASE-3048-v3.patch

v3 patch is final one for commit.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, HBASE-3048-v3.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan updated HBASE-3048:
-----------------------------------------

    Description: 
Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.

Unifying minor compactions to do the same thing as major compactions has other advantages:

* If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.

* We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.

-

Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.


  was:
Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.

Unifying minor compactions to do the same thing as major compactions has other advantages:

* If the same data is overwritten several times and we are not processing overwrites, it makes each subsequent minor compaction more expensive as the total amount of data.

* We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.

-

Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.



> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929263#action_12929263 ] 

HBase Review Board commented on HBASE-3048:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1185/#review1835
-----------------------------------------------------------


Looking at this now, it's at completely different level than the new InternalScan we have.  That deals with scanner selection, this deals with an option in the QueryMatcher.  There are already a bunch of options in SQM so we could consolidate at some point, but it's only a single level so doesn't seem so bad.

I'm +1 on it (it's not my patch) but want to get someone else to review.

- Jonathan





> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3048) unify code for major/minor compactions

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3048:
---------------------------------

    Attachment: HBASE-3048-0.90-v2.patch

This is a rebased patch for trunk.  We committed this to our internal 0.89-based repo.

> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>         Attachments: HBASE-3048-0.90-v2.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3048) unify code for major/minor compactions

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930279#action_12930279 ] 

Kannan Muthukkaruppan commented on HBASE-3048:
----------------------------------------------

Prakash reported that he was seeing really good benefit of this optimization for his "incr" use case (which essentially rewrites the same keys over and over again). So doing delete processing in minor compactions reduce the working set nicely, and also improves block cache efficiencies and so on.



> unify code for major/minor compactions
> --------------------------------------
>
>                 Key: HBASE-3048
>                 URL: https://issues.apache.org/jira/browse/HBASE-3048
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Amitanand Aiyer
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3048-0.90-v2.patch, HBASE-3048-v3.patch, unify.patch
>
>
> Today minor compactions do not process deletes, purge old versions, etc. Only major compactions do.  The rationale was probably to save CPU (?). We should evaluate if major compaction logic indeed runs significantly slower.
> Unifying minor compactions to do the same thing as major compactions has other advantages:
> * If the same keys are deleted/updated repeatedly, the fact that deletes/overwrites are not processed during minor compaction makes each subsequent minor compaction more expensive as the total amount of data keeps growing.
> * We'll have fewer bugs if the logic is as symmetric as possible. Any bugs in TTL enforcement, version enforcement, etc. could cause behavior to be different after a major compaction. Keeping the same logic means these bugs will get caught earlier.
> -
> Note: There will still need to be one difference in the two schemes, and that has to do with delete markers. Any compaction which doesn't compact all files will still need to leave delete markers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.