You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Mikhail Bautin (Created) (JIRA)" <ji...@apache.org> on 2012/01/23 23:35:40 UTC

[jira] [Created] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Preserving cached data on compactions through cache-on-write
------------------------------------------------------------

                 Key: HBASE-5263
                 URL: https://issues.apache.org/jira/browse/HBASE-5263
             Project: HBase
          Issue Type: Improvement
            Reporter: Mikhail Bautin
            Assignee: Mikhail Bautin
            Priority: Minor


We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 

Thanks to Nicolas for this elegant online algorithm idea!


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Posted by "Kannan Muthukkaruppan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205645#comment-13205645 ] 

Kannan Muthukkaruppan commented on HBASE-5263:
----------------------------------------------

Promising idea! 

In terms of the implementation details, it would be nice to avoid some pathological cases... were cold data (which was in the cache but almost on its way out of the cache) becomes hot again. I am guessing a naive approach could have this pitfall, but something that additionally takes into consideration the hotness of the keys in the block and appropriately places the data in the correct place in the blockcache LRU would not. Haven't thought through much about the implementation details... but wanted to throw out the initial thoughts at least.

See also related idea by Liyin here: HBASE-5263. These could be complementary approaches.
                
> Preserving cached data on compactions through cache-on-write
> ------------------------------------------------------------
>
>                 Key: HBASE-5263
>                 URL: https://issues.apache.org/jira/browse/HBASE-5263
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>            Priority: Minor
>
> We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 
> Thanks to Nicolas for this elegant online algorithm idea!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Posted by "Kannan Muthukkaruppan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205667#comment-13205667 ] 

Kannan Muthukkaruppan commented on HBASE-5263:
----------------------------------------------

Zhihong: Yes! Fixed it in place. I had a recursive reference going there... :)
                
> Preserving cached data on compactions through cache-on-write
> ------------------------------------------------------------
>
>                 Key: HBASE-5263
>                 URL: https://issues.apache.org/jira/browse/HBASE-5263
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>            Priority: Minor
>
> We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 
> Thanks to Nicolas for this elegant online algorithm idea!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205648#comment-13205648 ] 

Zhihong Yu commented on HBASE-5263:
-----------------------------------

@Kannan:
I think you were referring to HBASE-5369.
                
> Preserving cached data on compactions through cache-on-write
> ------------------------------------------------------------
>
>                 Key: HBASE-5263
>                 URL: https://issues.apache.org/jira/browse/HBASE-5263
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>            Priority: Minor
>
> We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 
> Thanks to Nicolas for this elegant online algorithm idea!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Posted by "Kannan Muthukkaruppan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205645#comment-13205645 ] 

Kannan Muthukkaruppan edited comment on HBASE-5263 at 2/10/12 7:23 PM:
-----------------------------------------------------------------------

Promising idea! 

In terms of the implementation details, it would be nice to avoid some pathological cases... were cold data (which was in the cache but almost on its way out of the cache) becomes hot again. I am guessing a naive approach could have this pitfall, but something that additionally takes into consideration the hotness of the keys in the block and appropriately places the data in the correct place in the blockcache LRU would not. Haven't thought through much about the implementation details... but wanted to throw out the initial thoughts at least.

See also related idea by Liyin here: HBASE-5369. These could be complementary approaches.
                
      was (Author: kannanm):
    Promising idea! 

In terms of the implementation details, it would be nice to avoid some pathological cases... were cold data (which was in the cache but almost on its way out of the cache) becomes hot again. I am guessing a naive approach could have this pitfall, but something that additionally takes into consideration the hotness of the keys in the block and appropriately places the data in the correct place in the blockcache LRU would not. Haven't thought through much about the implementation details... but wanted to throw out the initial thoughts at least.

See also related idea by Liyin here: HBASE-5639. These could be complementary approaches.
                  
> Preserving cached data on compactions through cache-on-write
> ------------------------------------------------------------
>
>                 Key: HBASE-5263
>                 URL: https://issues.apache.org/jira/browse/HBASE-5263
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>            Priority: Minor
>
> We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 
> Thanks to Nicolas for this elegant online algorithm idea!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5263) Preserving cached data on compactions through cache-on-write

Posted by "Kannan Muthukkaruppan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205645#comment-13205645 ] 

Kannan Muthukkaruppan edited comment on HBASE-5263 at 2/10/12 7:23 PM:
-----------------------------------------------------------------------

Promising idea! 

In terms of the implementation details, it would be nice to avoid some pathological cases... were cold data (which was in the cache but almost on its way out of the cache) becomes hot again. I am guessing a naive approach could have this pitfall, but something that additionally takes into consideration the hotness of the keys in the block and appropriately places the data in the correct place in the blockcache LRU would not. Haven't thought through much about the implementation details... but wanted to throw out the initial thoughts at least.

See also related idea by Liyin here: HBASE-5639. These could be complementary approaches.
                
      was (Author: kannanm):
    Promising idea! 

In terms of the implementation details, it would be nice to avoid some pathological cases... were cold data (which was in the cache but almost on its way out of the cache) becomes hot again. I am guessing a naive approach could have this pitfall, but something that additionally takes into consideration the hotness of the keys in the block and appropriately places the data in the correct place in the blockcache LRU would not. Haven't thought through much about the implementation details... but wanted to throw out the initial thoughts at least.

See also related idea by Liyin here: HBASE-5263. These could be complementary approaches.
                  
> Preserving cached data on compactions through cache-on-write
> ------------------------------------------------------------
>
>                 Key: HBASE-5263
>                 URL: https://issues.apache.org/jira/browse/HBASE-5263
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Mikhail Bautin
>            Assignee: Mikhail Bautin
>            Priority: Minor
>
> We are tackling HBASE-3976 and HBASE-5230 to make sure we don't trash the block cache on compactions if cache-on-write is enabled. However, it would be ideal to reduce the effect compactions have on the cached data. For every block we are writing for a compacted file we can decide whether it needs to be cached based on whether the original blocks containing the same data were already in cache. More precisely, for every HFile reader in a compaction we can maintain a boolean flag saying whether the current key-value came from a disk IO or the block cache. In the HFile writer for the compaction's output we can maintain a flag that is set if any of the key-values in the block being written came from a cached block, use that flag at the end of a block to decide whether to cache-on-write the block, and reset the flag to false on a block boundary. If such an inclusive approach would still trash the cache, we could restrict the total number of blocks to be cached per an output HFile, switch to an "and" logic instead of "or" logic for deciding whether to cache an output file block, or only cache a certain percentage of output file blocks that contain some of the previously cached data. 
> Thanks to Nicolas for this elegant online algorithm idea!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira