You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Raymond Liu (JIRA)" <ji...@apache.org> on 2013/02/27 09:13:14 UTC

[jira] [Commented] (HBASE-4433) avoid extra next (potentially a seek) if done with column/row

    [ https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588107#comment-13588107 ] 

Raymond Liu commented on HBASE-4433:
------------------------------------

I got a issue here related to this one. For a table which do not have multiple version for it's row. each row only got a single version. thus, a next operation will read in the next column's keyvalue and match the next column without a seek operation. In this case, this next() operation is actually save the time and improve the performance. With a 200G table to scan in my test, next instead of seek with be 30% faster. say 190s v.s. 250s.

So I think this behavior might need to be treat differently for different situation. For I think this one version each row read only table is also very typical case. And this patch actually make the performance worse.
                
> avoid extra next (potentially a seek) if done with column/row
> -------------------------------------------------------------
>
>                 Key: HBASE-4433
>                 URL: https://issues.apache.org/jira/browse/HBASE-4433
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Kannan Muthukkaruppan
>             Fix For: 0.92.0
>
>
> [Noticed this in 89, but quite likely true of trunk as well.]
> When we are done with the requested column(s) the code still does an extra next() call before it realizes that it is actually done. This extra next() call could potentially result in an unnecessary extra block load. This is likely to be especially bad for CFs where the KVs are large blobs where each KV may be occupying a block of its own. So the next() can often load a new unrelated block unnecessarily.
> --
> For the simple case of reading say the top-most column in a row in a single file, where each column (KV) was say a block of its own-- it seems that we are reading 3 blocks, instead of 1 block!
> I am working on a simple patch and with that the number of seeks is down to 2. 
> [There is still an extra seek left.  I think there were two levels of extra/unnecessary next() we were doing without actually confirming that the next was needed. One at the StoreScanner/ScanQueryMatcher level which this diff avoids. I think the other is at hfs.next() (at the storefile scanner level) that's happening whenever a HFile scanner servers out a data-- and perhaps that's the additional seek that we need to avoid. But I want to tackle this optimization first as the two issues seem unrelated.]
> -- 
> The basic idea of the patch I am working on/testing is as follows. The ExplicitColumnTracker currently returns "INCLUDE" to the ScanQueryMatcher if the KV needs to be included and then if done, only in the the next call it returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases when ExplicitColumnTracker knows it is done with a particular column/row, the patch attempts to combine the INCLUDE code and done hint into a single match code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira