You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2012/12/05 04:38:58 UTC

[jira] [Created] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Lars Hofhansl created HBASE-7279:
------------------------------------

             Summary: Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
                 Key: HBASE-7279
                 URL: https://issues.apache.org/jira/browse/HBASE-7279
             Project: HBase
          Issue Type: Bug
            Reporter: Lars Hofhansl
            Assignee: Lars Hofhansl
             Fix For: 0.96.0, 0.94.4


Did some profiling again.
I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.

[1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510714#comment-13510714 ] 

stack commented on HBASE-7279:
------------------------------

+1
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510287#comment-13510287 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

peeked is a local stack reference, so that should be ok

Re: how to verify the row cache is not needed... I looked at every caller of KeyValue.getRow(). These are either:
# tests
# not a hot code path (like getting the first key during a split)
# or from inspection it can be seen that the KV is not used twice

I think we're good on that front.

I am less certain about the timestamp cache, so I could put that back, and we leave that for another patch (after removing the timestamp cache I was not observing any change - neither speedup nor slowdown).

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510266#comment-13510266 ] 

stack commented on HBASE-7279:
------------------------------

I don't care about whether it pretty or not.  Its fine writing a bit more code for some savings.  How you think we test more if we are losing out by removing the row and ts caches?

Patch looks good.

Could peeked change under you while you do the below?  Or is it single thread only in here?

+    byte[] row = peeked.getBuffer();
+    int offset = peeked.getRowOffset();
+    short length = peeked.getRowLength();
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510314#comment-13510314 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

Cool. I'll make a trunk version for HadoopQA.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.96-v2.txt
    
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.94-v4.txt

Missed the test classes.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510301#comment-13510301 ] 

stack commented on HBASE-7279:
------------------------------

[~lhofhansl] Sorry. Remove timestamp caching.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511092#comment-13511092 ] 

Hudson commented on HBASE-7279:
-------------------------------

Integrated in HBase-0.94 #611 (See [https://builds.apache.org/job/HBase-0.94/611/])
    HBASE-7279 Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher (Revision 1417716)

     Result = FAILURE
larsh : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/client/Result.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt, 7279-0.96-v3.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510326#comment-13510326 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

[~mcorgan] Won't we then produce even more garbage? We can do that for KVs as well, but then we'll produce a *lot* of garbage. Or are you saying that works in your case, because you'll be reusing cells?
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510963#comment-13510963 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

Ran all tests in trunk with 7279-0.96-v3 and they all pass.
Going to commit.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt, 7279-0.96-v3.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510331#comment-13510331 ] 

Matt Corgan commented on HBASE-7279:
------------------------------------

Yeah - we can't do it with the current immutable KeyValues because of garbage and/or memory bloat, but the encoded scanners reuse a backing object where you can cache as many values as you want because you're reusing everything.  See BufferedDataBlockEncoder.SeekerState (which i'm making implement the Cell interface).  The SeekerState gets reused as the scanner trots along.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

committed to 0.94 and 0.96, thanks for the review
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt, 7279-0.96-v3.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510295#comment-13510295 ] 

Lars Hofhansl edited comment on HBASE-7279 at 12/5/12 6:07 AM:
---------------------------------------------------------------

[~saint.ack@gmail.com] You mean leave out the timestamp cache, or leave out the change that removes the timestamp cache? :)  I can go either way.

However, 8 bytes is not insignificant (the rest of a KV is just 16 + 24 + 4 + 4 + 4 + 8 = 52). (makes me want to remove the keyLength cache as well for another 4 bytes)

At Salesforce we're doing some scans over close to 1bn rows/kvs (most of which won't be shipped to the client).
The issue with the timestamp cache is that it will use 8 bytes, whether we cache anything or not. Over the 1bn KVs we'll produce 8GB of garbage just for this cache. 

I would like to put this into 0.94 as well.

                
      was (Author: lhofhansl):
    You leave out the timestamp cache, or leave out the change that removes the timestamp cache? :)  I can go either way.

However, 8 bytes is not insignificant (the rest of a KV is just 16 + 24 + 4 + 4 + 4 + 8 = 52). (makes me want to remove the keyLength cache as well for another 4 bytes)

At Salesforce we're doing some scans over close to 1bn rows/kvs (most of which won't be shipped to the client).
The issue with the timestamp cache is that it will use 8 bytes, whether we cache anything or not. Over the 1bn KVs we'll produce 8GB of garbage just for this cache. 

I would like to put this into 0.94 as well.

                  
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.94.txt

Here's a 0.94 patch (that's where I have all my testing setup).
If folks like this patch I'll make a trunk version.

You'll notice that things aren't quite as pretty anymore, with the byte[] + offset and length needing to be passed around.

(we could envision an "ArrayPtr" object, which holds a reference to an array, offset, and length, but then that would another object to create)

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510293#comment-13510293 ] 

Matt Corgan commented on HBASE-7279:
------------------------------------

{quote}(we could envision an "ArrayPtr" object, which holds a reference to an array, offset, and length, but then that would another object to create){quote}
Lars - I submitted a ByteRange class when the Cell interface was committed which is exactly the ArrayPtr.  I use it extensively in the PrefixTree module to make the code more robust and readable, but I always pool and reuse them in the hot code paths.  It makes byte[]'s as easy to use as Strings for comparisons, copying, substrings, collections, sorting, deduping, etc.  Sounds like not the right choice for this situation, but thought I'd point it out so you know about it.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.96-v3.txt

A reset of the rowcounter (for intra row pagination new to 0.96) got lost in the patch.
Found that through running the tests locally.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt, 7279-0.96-v3.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510322#comment-13510322 ] 

Matt Corgan commented on HBASE-7279:
------------------------------------

on a related note - one of the benefits of the mutable Cell implementations is that the first time a cell gets parsed out of the data block, we can store all the offset/length variables in nice fast int primitives.  I'm trying to convert the existing SeekerState to this right now.  

When the cells are travelling through all the scanners/heaps/filters, the methods like getQualifierLength() will simply return the already-calculated primitive int.  With plain KeyValue as it is now, each time getQualifierLength() is called you have to do all of the following, and it may get called many times on the way from disk to client:
{code}
  @Override
  public short getRowLength() {
    return Bytes.toShort(this.bytes, getKeyOffset());
  }
  public int getFamilyOffset(int rlength) {
    return this.offset + ROW_OFFSET + Bytes.SIZEOF_SHORT + rlength + Bytes.SIZEOF_BYTE;
  }
  @Override
  public byte getFamilyLength() {
    return getFamilyLength(getFamilyOffset());
  }
  @Override
  public int getQualifierLength() {
    return getQualifierLength(getRowLength(),getFamilyLength());
  }
  public int getQualifierLength(int rlength, int flength) {
    return getKeyLength() - (int) getKeyDataStructureSize(rlength, flength, 0);
  }
  public static long getKeyDataStructureSize(int rlength, int flength, int qlength) {
    return KeyValue.KEY_INFRASTRUCTURE_SIZE + rlength + flength + qlength;
  }
{code}
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.94-v2.txt

While I'm at it, might as well correct the array size in the Result(List) constructor.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Status: Patch Available  (was: Open)
    
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510296#comment-13510296 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

[~mcorgan] Thanks for point me at that. Probably not the right choice here. The local offset/length variables are just stack allocated and hence do not produce any garbage (in the GC sense).

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.96-v1.txt

And a 0.96 version
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-7279:
---------------------------------

    Attachment: 7279-0.94-v3.txt

0.94 patch with correct heapsize (needed to all remove a REFERENCE).
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.96-v1.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510288#comment-13510288 ] 

stack commented on HBASE-7279:
------------------------------

Leave it out I'd say.  We'll be profiling 0.96.  Should show its head again if a prob.  Can put it back then.
                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511067#comment-13511067 ] 

Hudson commented on HBASE-7279:
-------------------------------

Integrated in HBase-TRUNK #3594 (See [https://builds.apache.org/job/HBase-TRUNK/3594/])
    HBASE-7279 Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher (Revision 1417715)

     Result = FAILURE
larsh : 
Files : 
* /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestQueryMatcher.java

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt, 7279-0.94-v2.txt, 7279-0.94-v3.txt, 7279-0.94-v4.txt, 7279-0.96-v1.txt, 7279-0.96-v2.txt, 7279-0.96-v3.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7279) Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510295#comment-13510295 ] 

Lars Hofhansl commented on HBASE-7279:
--------------------------------------

You leave out the timestamp cache, or leave out the change that removes the timestamp cache? :)  I can go either way.

However, 8 bytes is not insignificant (the rest of a KV is just 16 + 24 + 4 + 4 + 4 + 8 = 52). (makes me want to remove the keyLength cache as well for another 4 bytes)

At Salesforce we're doing some scans over close to 1bn rows/kvs (most of which won't be shipped to the client).
The issue with the timestamp cache is that it will use 8 bytes, whether we cache anything or not. Over the 1bn KVs we'll produce 8GB of garbage just for this cache. 

I would like to put this into 0.94 as well.

                
> Avoid copying the rowkey in RegionScanner, StoreScanner, and ScanQueryMatcher
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-7279
>                 URL: https://issues.apache.org/jira/browse/HBASE-7279
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Lars Hofhansl
>             Fix For: 0.96.0, 0.94.4
>
>         Attachments: 7279-0.94.txt
>
>
> Did some profiling again.
> I we can gain some performance [1] when passing buffer, rowoffset, and rowlength instead of making a copy of the row key.
> That way we can also remove the row key caching (and this patch also removes the timestamps caching). Considering the sheer number in which we create KVs, every byte save is good.
> [1] (15-20% when data is in the block cache we setup a Filter such that only a single row is returned to the client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira