You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Keith Turner (JIRA)" <ji...@apache.org> on 2012/07/13 01:49:35 UTC

[jira] [Comment Edited] (ACCUMULO-473) Support binary search within RFile blocks

    [ https://issues.apache.org/jira/browse/ACCUMULO-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13411878#comment-13411878 ] 

Keith Turner edited comment on ACCUMULO-473 at 7/12/12 11:48 PM:
-----------------------------------------------------------------

I experimented with dynamically indexing cached rfile blocks and the results were promising.  For my experiments I created a Rfile with 100K entries and a 50 byte value.  I ran the attached Test.java program.  This program seeks every key in the file backwards, forwards, and randomly.  The rfile block size was 100k and the file had 69 blocks, so each block had around 1.4K entries. 

||Test||Lookup Rate w/o transient index||Lookup Rate w/ transient index||
|Backwards| 8.5K lookup/sec | 86K lookup/sec |
|Random   | 8.5K lookup/sec | 75K lookup/sec |
|Forwards | 452K lookup/sec | 500K lookup/sec |

For the backwards test I took the last three numbers, the first two were slower as a result of the JIT not kicking in.

For these test there was a 9x to 10x speedup.  For each block of 1.4K keys an index of 31 keys was eventually created, then it stopped indexing (I added some prints to the code to get this info).  The code does not create indexes where the keys in the block are closer than 32 keys.

These test show that seeking forwards is really fast, the code was already optimized for this.  So if an iterator can seek forwards it should, but this change will help with seeking anywhere.  I am not sure why seeking forwards is so much faster.  My current theory is that alot more object creation is done when seeking backwards or randomly.  I have not confirmed this.  Even when I adjust dynamic index code to create larger indexes, its still does not approach the forward seek speed.  
                
      was (Author: kturner):
    I experimented with dynamically indexing cached rfile blocks and the results were promising.  For my experiments I created a Rfile with 100K entries and a 50 byte value.  I ran the attached Test.java program.  This program seeks every key in the file backwards, forwards, and randomly.  The rfile block size was 100k and the file had 100 block, so each block had around 1.4K entries. 

||Test||Lookup Rate w/o transient index||Lookup Rate w/ transient index||
|Backwards| 8.5K lookup/sec | 86K lookup/sec |
|Random   | 8.5K lookup/sec | 75K lookup/sec |
|Forwards | 452K lookup/sec | 500K lookup/sec |

For the backwards test I took the last three numbers, the first two were slower as a result of the JIT not kicking in.

For these test there was a 9x to 10x speedup.  For each block of 1.4K keys an index of 31 keys was eventually created, then it stopped indexing (I added some prints to the code to get this info).  The code does not created indexes where the keys in the block are closer than 32 keys.

These test show that seeking forwards is really fast, the code was already optimized for this.  So if an iterator can seek forwards it should, but this change will help with seeking anywhere.  I am not sure why seeking forwards is so much faster.  My current theory is that alot more object creation is done when seeking backwards or randomly.  I have not confirmed this.  Even when I adjust dynamic index code to create larger indexes, its still does not approach the forward seek speed.  
                  
> Support binary search within RFile blocks
> -----------------------------------------
>
>                 Key: ACCUMULO-473
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-473
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.5.0
>
>         Attachments: ACCUMULO-473-1.txt, ACCUMULO-473-2.txt, ACCUMULO-473-3.txt, Test.java
>
>
> RFiles store blocks of key values using relative encoding.  By default these blocks are small (100k).  To find a key in an RFile the index is used to find the block, then the block is scanned for the key.  It would be nice for iterators that do alot of seeking if a binary search were done instead of a scan.  
> Relative encoding is a form of compression that serializes a key relative to the last key.  For example if the row in a key is the same as the previous key, then the row is not stored again.  This works well with sorted data.   The current on disk format does not support random access within a block, because to read entry N all previous entries must be read.   One option would be to deserialize the block into a form that supports random access and then do binary search.  However this will consume memory and CPU.  If the relative encoding format could be modified to support random access, then binary search could be supported in a memory and CPU efficient way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira