You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Matt Corgan (Created) (JIRA)" <ji...@apache.org> on 2011/10/26 01:14:32 UTC

[jira] [Created] (HBASE-4676) Prefix Compression - Trie data block encoding

Prefix Compression - Trie data block encoding
---------------------------------------------

Key: HBASE-4676
URL: https://issues.apache.org/jira/browse/HBASE-4676
Project: HBase
Issue Type: New Feature
Components: io
Affects Versions: 0.94.0
Reporter: Matt Corgan

The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.

First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.

Second, there is no random access to KeyValues inside data blocks. This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2. The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.

The PrefixTrie block encoding format attempts to solve both of these problems. Some features:

* trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
* the column family is currently stored once at the beginning of each block. this could easily be modified to allow multiple family names per block
* all qualifiers in the block are stored in their own trie format which caters nicely to wide rows. duplicate qualifers between rows are eliminated. the size of this trie determines the width of the block's qualifier fixed-width-int
* the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that. the maximum delta determines the width of the block's timestamp fixed-width-int

The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values. Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row. Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].

If all operation types are the same for a block, there will be zero per-cell overhead. Same for timestamps. Same for qualifiers when i get a chance. So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.

A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed. Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).

One current drawback is the current write speed. While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path. Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase. It will still be slower than delta encoding, but with a much higher decode speed. I have not yet created a thorough benchmark for write speed nor sequential read speed.

Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal. The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between. When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.

Current code is on github. The trie code is in a separate project than the slightly modified hbase. There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.

https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
https://github.com/hotpads/hbase-prefix-trie

I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: SeeksPerSec by blockSize.png

See the attached chart for a comparison of seeks/second between the current block format (NONE), the PREFIX delta encoding, and the TRIE encoding.  This shows how important the random block access is when blocks are at the default size, let alone if they're larger.  With 16 software threads running on a 4 core (8 hyperthread) cpu, the current block format NONE can do ~38k seeks/s, the PREFIX encoding can do ~20k, and the TRIE encoding can do ~1.6mm.  So 40-80x faster seeks at medium block size.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.94.0
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
> https://github.com/hotpads/hbase-prefix-trie
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485834#comment-13485834 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12551154/HBASE-4676-prefix-tree-trunk-v3.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 143 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 102 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 49 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238948#comment-13238948 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

{quote}Do we not have this now Matt?{quote}
I don't believe so.  Right now, HFileReaderV2.blockSeek(..) does a brute force loop through each KV in the block and does a full KV comparison at each KV.  Say there are 1000 KVs in the block and 20/row, on average you will have to do 500 full KV comparisons just to get to your key, and even if you are selecting all 20 KVs in the row you still do 500 comparisons on average to get to the start of the row.  It's also very important to remember that these are long keys and they are most likely to share a common prefix since they got sorted to the same block, so you're comparators are churning the same prefix bytes over and over.

{quote}The slow write and scan speeds would tend to rule it out for anything but a random read workload but when random reading only{quote}
Yeah, it's not really targeted at long scans on cold data, but there are some cases in between long cold scans and hot point queries.  Some factors: If it's warm data your block cache is now 6x bigger.  Then there are many scans that are really sub-block prefix scans to get all the keys in a row (the 20 of 1000 cells i mentioned above).  If your scan will have a low hit ratio then the ability to jump between rows inside a block will lessen CPU usage.

It performs best when doing lots of random Gets on hot data (in block cache).  Possibly a persistent memcached alternative, especially with SSDs.  I believe the current system is actually limited by CPU when doing high throughput Gets on data with long keys and short values.  The trie's individual request latency may not be much different, but a single server could serve many more parallel requests before maxing out cpu.

The bigger the value/key ratio of your data the smaller the difference between trie and anything else.  Seems like many people now have big values so I doubt they would see a difference.  I'm more trying to enable smarter schemas with compound primary keys.

{quote}Regards your working set read testing, did it all fit in memory?{quote}
yes, i'm trying to do the purest comparison of cpu usage possible.  leaving it up to others to extrapolate the results to what happens with the effectively bigger block cache, more rows/block fetched from disk for equivalent size block, etc.  i'm currently just standing up a StoreFile and using it directly.  see https://github.com/hotpads/hbase-prefix-trie/blob/hcell-scanners/test/org/apache/hadoop/hbase/cell/pt/test/performance/seek/SeekBenchmarkMain.java.  i'll try to factor in network and disk latencies in later tests (did some preliminary tests friday but am on vacation this week).

{quote}How did you measure cycles per seek?{quote}
simple assumption of 2 billion/second.  was just trying to emphasize how many cycles a seek currently takes

{quote}What is numBlocks? The total number of blocks the dataset fit in?{quote}
number of data blocks in the HFile i fed in
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Zhihong Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu reassigned HBASE-4676:
---------------------------------

    Assignee: Matt Corgan
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Status: Open  (was: Patch Available)
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v1.patch

Attaching everything as a single patch: HBASE-4676-prefix-tree-trunk-v1.patch
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251375#comment-13251375 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Sorry, the packaging is unusual for HBase but I think it's worth a shot at turning this into a module rather than dumping 200+ new .java files into hbase-core.  As it stands, the dependencies are very limited.  The hbase-prefix-trie project needs to see KeyValue.java and DataBlockEncoder.java (maybe a couple others, but not many).  Then, hbase-core instantiates a PrefixTrieDataBlockEncoder through reflection at the bottom of DataBlockEncoding.java, so hbase-core has no compile-time dependency on hbase-prefix-trie.

If HBASE-4336 makes headway, then we can extract some of the fundamental hbase classes (KeyValue, DataBlockEncoder) into hbase-common, and I can further limit the prefix-trie to only have a dependency on these fundamental interfaces in the lightweight hbase-common module.  HBase-common will tie together the modules.  Limiting the dependencies between modules is the key to keeping development agile, otherwise small changes will trigger massive refactorings that even the most seasoned devs won't be able to pull off safely.

I added instructions to the git project homepage for checking out the prefix trie and modifying it or using it for profiling.  Just a few extra steps beyond applying a patch: https://github.com/hotpads/hbase-prefix-trie.  Pasting them here as well:

To use with Eclipse:

* check out hbase-0.94 from apache subversion to "myworkspace"
* apply the HBASE-4676-0.94-v1.patch from HBASE-4676

* cd myworkspace
* git clone -b 0.1 git@github.com:hotpads/hbase-prefix-trie.git hbase-prefix-trie-0.1
* in eclipse -> new java project
* type "hbase-prefix-trie-0.1" and eclipse should find it.  continue
* on the "Projects" tab, add hbase-0.94.  finish
* ctrl-shift-R to open SeekBenchmarkMain.java
* follow instructions in SeekBenchmarkMain

To build a jar after modifying code:
* edit package-hbase-prefix-trie.xml in the project root to point to your build directory
* right click it -> Run As -> Ant Build
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251836#comment-13251836 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

the git url in above comment is wrong.  reposting instructions with public https endpoint (looks like i can't edit jira comments)

To use with Eclipse:

* check out hbase-0.94 from apache subversion to "myworkspace"
* apply the HBASE-4676-0.94-v1.patch from HBASE-4676
* cd myworkspace
* git clone -b 0.1 https://hotpads@github.com/hotpads/hbase-prefix-trie.git hbase-prefix-trie-0.1
* in eclipse -> new java project
* type "hbase-prefix-trie-0.1" and eclipse should find it. continue
* on the "Projects" tab, add hbase-0.94. finish
* ctrl-shift-R to open SeekBenchmarkMain.java
* follow instructions in SeekBenchmarkMain

To build a jar after modifying code:

* edit package-hbase-prefix-trie.xml in the project root to point to your build directory
* right click it -> Run As -> Ant Build

                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483806#comment-13483806 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Does anyone know why TestHFileDataBlockEncoder would fail on jenkins when it passes 100% of the time locally for me?  I believe the part that's failing has something to do with timestamps.  Does jenkins do something different with the clock?
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140704#comment-13140704 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Write speed is slower because of cpu usage.  There's still tons of room for optimization, but slower write speed will generally be the trade off for higher random read speed.  The current encoding algorithms on github throw off a lot of garbage which tends to indicate that they're not using the cpu cache very well.

Some data sets would be able to disable gzip/lzo compression and just go with prefix compression, so you save some cpu there.

I'm working on a benchmark to compare a few more factors, including compression ratio and sequential read speed.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.94.0
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
> https://github.com/hotpads/hbase-prefix-trie
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457131#comment-13457131 ] 

Ted Yu commented on HBASE-4676:
-------------------------------

@Matt:
Do you want to attach your patch(es) here for Hadoop QA to run test suite ?

Thanks
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439037#comment-13439037 ] 

stack commented on HBASE-4676:
------------------------------

You need review on this Matt?
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: PrefixTrie_Performance_v1.pdf

Attachment PrefixTrie_Performance_v1.pdf shows current TRIE encoder performance compared to NONE and PREFIX.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Status: Patch Available  (was: Open)
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v2.patch

attaching full patch with fixes: HBASE-4676-prefix-tree-trunk-v2.patch

                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493840#comment-13493840 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12552792/HBASE-4676-prefix-tree-trunk-v5.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 142 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 103 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 58 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
                  org.apache.hadoop.hbase.TestDrainingServer

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-common-and-server-v8.patch

HBASE-4676-common-and-server-v8.patch contains changes to hbase-common and hbase-server modules.  KeyValue implements new Cell interface.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475994#comment-13475994 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12549106/HBASE-4676-prefix-tree-trunk-v1.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 139 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 99 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 53 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.encoding.TestChangingEncoding
                  org.apache.hadoop.hbase.regionserver.TestStore
                  org.apache.hadoop.hbase.io.encoding.TestDataBlockEncoders

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287829#comment-13287829 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

A little more detail... it would help where you have long qualifiers but only a few of them, like if you migrated a narrow relational db table over.  If you have wide rows with long qualifiers, then you would want to take advantage of the qualifier trie.  Not sure which case you have, but migrating relational style tables over will be pretty common so i wanted to handle that common case well.

If you get a chance to do a random read benchmark on it I'd love to hear the results.  I've only done a few small benchmarks at the Store level and haven't benchmarked a whole cluster.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139472#comment-13139472 ] 

stack commented on HBASE-4676:
------------------------------

bq. ....and the TRIE encoding can do ~1.6mm. So 40-80x faster seeks at medium block size.

Thats a pretty crazy different Matt.

The write speed is slower because its burning cpu doing the 'compression', right?

So TRIE takes up 1/4 size.  Did you say above how much less a PREFIX compressed block is than a NONE compressed block?


                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.94.0
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
> https://github.com/hotpads/hbase-prefix-trie
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439090#comment-13439090 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Barely been able to touch it this whole summer, but I spent a little time in july rebasing to trunk and reorganizing for readability.  It's on github on a branch called "prefix-tree" (i renamed from trie since people wince when pronouncing it).  https://github.com/hotpads/hbase/tree/prefix-tree

What's up there is pretty good and workable, but i still have some things to do and some questions to ask for things like comparators, sequenceIds, ROOT and META handling.  Also needs more documentation.  I could use some high level feedback if you can get anything out of it without a lot of docs/comments.  Maybe look at the things in the hbase-common module to start since that will affect overall hbase?  The prefix-trie module is more implementation details that only matter when you enable this specific encoding.

btw - current version has an "HCell" interface which i will rename to Cell(?) after seeing the recent discussion about HStore naming.

Happy to post on reviewboard as well, but might have to wait till the weekend.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483864#comment-13483864 ] 

Ted Yu commented on HBASE-4676:
-------------------------------

I tried to run test suite on the following platform and got the exact same test failure:
{code}
Linux s0 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
{code}
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483831#comment-13483831 ] 

Ted Yu commented on HBASE-4676:
-------------------------------

I ran TestHFileDataBlockEncoder on MacBook 29 times, on Linux a few times.
The test didn't fail.
Looking at the stack provided by Hadoop QA, looks like the NPE was caused by this code in TimestampEncoder.java:
{code}
    // should always find an exact match
    return Arrays.binarySearch(sortedUniqueTimestamps, timestamp);
{code}
Meaning sortedUniqueTimestamps was null.
Can you guard against this case ?

Thanks
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: hbase-prefix-trie-0.1.jar
                HBASE-4676-0.94-v1.patch

Attached 2 files for anyone that would like to experiment with this.  HBASE-4676-0.94-v1.patch should apply to the 0.94 branch, and put hbase-prefix-trie-0.1.jar into your deployment's lib directory.  Restart hbase, and change a CF's DATA_BLOCK_ENCODING=>TRIE

The java files are included in the jar, and I've tagged this version as v0.1 at https://github.com/hotpads/hbase-prefix-trie/tags

See https://github.com/hotpads/hbase-prefix-trie/blob/hcell-scanners/test/org/apache/hadoop/hbase/cell/pt/test/performance/seek/SeekBenchmarkMain.java to run a profiler on HFiles from your own cluster.

Any feedback or questions are welcome
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135600#comment-13135600 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

The attached chart "SeeksPerSec by blockSize.png" is for a table with a 4b key and small value.  It is a pretty extreme case that shows the strength of the trie in reading and its weakness in writing.  

Here is a comparison from a table we use internally for counters:

{code}
table name: Count5s
raw KeyValue blockSize: 256KB
encoding savings: ~4x
encoded blockSize: ~64KB
numReaderThreads: 1
pread: false
numCells: 1339210
avgKeyBytes: 85
avgValueBytes: 9
raw data size: 120MB
encoded data size: ~40MB (~5x L3 cache size)

encoding,sequentialMB/s,seeks/s
NONE,886.67,23604
PREFIX,404.3,15543
TRIE,25.63,340613
{code}

You can see the trie only encodes at 25MB/s, but i think this could be more like 200MB/s with some work.  Seek speed is ~15x NONE seek speed while expanding the effective block cache size by 4x.

Another advantage is that the trie should be able to operate on DirectByteBuffers without copying them to the heap first.  It would need a separate implementation of the PtArrayBlockSearcher called PtBufferBlockSearcher which would have the same code structure but replace all the bytes[i] (and similar) calls with buffer.get(i) calls.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.94.0
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
> https://github.com/hotpads/hbase-prefix-trie
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285830#comment-13285830 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Jacques - I like the idea , and I think the long-term goal is to auto-tune some of this stuff.  For example, if flushes or compactions are backlogged, that maybe a good signal for certain tables to temporarily switch from more compact Trie/GZip encoding/compression to faster-encoding Prefix/Snappy.  There are other factors at play though on a table-by-table basis.  Some tables might be random-read heavy, in which case you would want to keep the Trie.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496865#comment-13496865 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12553437/HBASE-4676-prefix-tree-trunk-v7.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 142 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 109 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 59 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v5.patch

HBASE-4676-prefix-tree-trunk-v5.patch puts a debug message in the NPE thrown by the previous patch.  The test in question passes locally.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "James Taylor (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287766#comment-13287766 ] 

James Taylor commented on HBASE-4676:
-------------------------------------

Our qualifiers tend to be long, so trie-encoding them helps quite a bit. We could optimize this ourselves by managing our own mapping, but it's great that the trie-encoding does it for us. We haven't seen this impact our scan times.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287637#comment-13287637 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

Great to hear James.  I'm working on migrating it to trunk, but wrestling with git and maven, my two worst friends.

The last change I've been thinking about making to it before finalizing this version is to add an option controlling whether to trie-encode the qualifiers vs merely de-duping them.  There's some read-time expense to decoding the trie-encoded qualifiers, and if you have a narrow table then you may not be saving much memory anyway.  So it would be an option to trade a little memory for faster scans/decoding.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494579#comment-13494579 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12552931/HBASE-4676-prefix-tree-trunk-v6.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 142 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 103 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 60 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Affects Version/s:     (was: 0.90.6)
                       0.96.0
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Jacques (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285793#comment-13285793 ] 

Jacques commented on HBASE-4676:
--------------------------------

Random thought on this.  I was talking to Nicolas Spiegelberg and he was talking about how they are exploring changing encoding schemes when data becomes more permanent.  Don't quote me on this but it sounded like they were considering using gzip compression on major compactions.  I was wondering if something similar made sense here.  Basically, use less compression the shorter the likely lifetime of files.  Use the Trie compressions on only major compactions.   This way the performance difference could be less of a serious issue since you're paying for it a fewer number of times.  I glanced around and didn't see any Jiras about a two-tiered approach of data storage formats but it seems like that would be a prerequisite for a hybrid/tiered approach.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: PrefixTrie_Format_v1.pdf

Attachment PrefixTrie_Format_v1.pdf illustrates what the encoding looks like for a simple example with URL row keys
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v7.patch

Attaching latest patch, v7, which addresses review comments and contains more docs
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v4.patch

Latest patch with hbase-common modifications from Stack's reviews: https://reviews.apache.org/r/7589/
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497561#comment-13497561 ] 

stack commented on HBASE-4676:
------------------------------

I'm up for committing the hbase-common and hbase-server changes that are up in RB.  Attach the patch here or rather to HBASE-7162, an issue just for these changes.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483041#comment-13483041 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12550594/HBASE-4676-prefix-tree-trunk-v2.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 143 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 99 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 49 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v3.patch

HBASE-4676-prefix-tree-trunk-v3.patch fixes a memstoreTS related bug.  Submitting this patch to see if TestHFileDataBlockEncoder still fails in jenkins (since it passes locally for me).
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "James Taylor (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287587#comment-13287587 ] 

James Taylor commented on HBASE-4676:
-------------------------------------

This is fantastic, Matt. We're testing your patch on 0.94 in our dev cluster here at Salesforce and are seeing 5-15x compression with no degradation on scan performance.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447427#comment-13447427 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

ReviewBoard links:

hbase-common: https://reviews.apache.org/r/6897/
hbase-prefix-tree: https://reviews.apache.org/r/6898/

excludes tests for now
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485872#comment-13485872 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

for the record, this is now on branch prefix-tree2 on github:
{code}
git clone -b prefix-tree2 https://hotpads@github.com/hotpads/hbase.git hbase-prefix-tree
{code}
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

          Component/s: regionserver
                       performance
          Description: 
The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  

First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.

Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.

The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:

* trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
* the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
* all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
* the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int

The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].

If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.

A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).

One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.

Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.

Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.

https://github.com/hotpads/hbase/tree/prefix-trie-1
https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners

I'll follow up later with more implementation ideas.

  was:
The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  

First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.

Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.

The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:

* trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
* the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
* all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
* the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int

The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].

If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.

A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).

One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.

Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.

Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.

https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
https://github.com/hotpads/hbase-prefix-trie

I'll follow up later with more implementation ideas.

    Affects Version/s:     (was: 0.94.0)
                       0.90.6
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496917#comment-13496917 ] 

stack commented on HBASE-4676:
------------------------------

Looking in v7 patch attached here:

Should this be in the patch Matt?  b/bin/index-builder-setup.rb

Did you mean to include the hbase prefix tree module in this hbase-common patch?  i.e.  b/hbase-prefix-tree/pom.xml?  I've not reviewed the prefixtree stuff.  Lets get the hbase-common changes in first (including the removes of classes moved from hbase-server to hbase-common?)

Or, if you think that v7 has all addressed up on RB, then I could commit only the bits of the v7 patch that touch hbase-common and hbase-server?




                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497577#comment-13497577 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12553580/HBASE-4676-common-and-server-v8.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 24 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 99 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 27 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.TestCompare
                  org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1
                  org.apache.hadoop.hbase.io.hfile.TestHFileBlockCompatibility
                  org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
                  org.apache.hadoop.hbase.regionserver.TestStoreFile
                  org.apache.hadoop.hbase.TestHTableDescriptor
                  org.apache.hadoop.hbase.io.TestHeapSize
                  org.apache.hadoop.hbase.TestHColumnDescriptor
                  org.apache.hadoop.hbase.master.TestCatalogJanitor
                  org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder
                  org.apache.hadoop.hbase.regionserver.TestColumnSeeking
                  org.apache.hadoop.hbase.regionserver.TestRSStatusServlet
                  org.apache.hadoop.hbase.regionserver.TestScanner
                  org.apache.hadoop.hbase.regionserver.TestSplitTransaction
                  org.apache.hadoop.hbase.regionserver.TestHBase7051
                  org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface
                  org.apache.hadoop.hbase.TestFSTableDescriptorForceCreation
                  org.apache.hadoop.hbase.io.hfile.TestHFileInlineToRootChunkConversion
                  org.apache.hadoop.hbase.regionserver.TestBlocksScanned
                  org.apache.hadoop.hbase.regionserver.TestKeepDeletes
                  org.apache.hadoop.hbase.filter.TestDependentColumnFilter
                  org.apache.hadoop.hbase.regionserver.TestResettingCounters
                  org.apache.hadoop.hbase.TestSerialization
                  org.apache.hadoop.hbase.coprocessor.TestRegionObserverStacking
                  org.apache.hadoop.hbase.io.TestHalfStoreFileReader
                  org.apache.hadoop.hbase.io.hfile.TestSeekTo
                  org.apache.hadoop.hbase.regionserver.TestScanWithBloomError
                  org.apache.hadoop.hbase.regionserver.wal.TestWALActionsListener
                  org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy
                  org.apache.hadoop.hbase.io.hfile.TestCachedBlockQueue
                  org.apache.hadoop.hbase.regionserver.TestHRegionInfo
                  org.apache.hadoop.hbase.regionserver.TestCompactSelection
                  org.apache.hadoop.hbase.constraint.TestConstraints
                  org.apache.hadoop.hbase.filter.TestColumnPrefixFilter
                  org.apache.hadoop.hbase.io.hfile.TestHFile
                  org.apache.hadoop.hbase.filter.TestMultipleColumnPrefixFilter
                  org.apache.hadoop.hbase.regionserver.TestMinVersions
                  org.apache.hadoop.hbase.rest.model.TestTableRegionModel
                  org.apache.hadoop.hbase.regionserver.TestWideScanner
                  org.apache.hadoop.hbase.client.TestIntraRowPagination
                  org.apache.hadoop.hbase.io.hfile.TestReseekTo
                  org.apache.hadoop.hbase.filter.TestFilter

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238645#comment-13238645 ] 

stack commented on HBASE-4676:
------------------------------

bq. The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.

Do we not have this now Matt?

As said before, 'So 40-80x faster seeks at medium block size.', is pretty nice improvement.  Ditto on 'Seek speed is ~15x NONE seek speed while expanding the effective block cache size by 4x.'

What are you thinking these times Matt?  The slow write and scan speeds would tend to rule it out for anything but a random read workload but when random reading only, you'll want to use prefixtrie (smile).  Do you think it possible to approach prefix compression scan and write speeds?

Regards your working set read testing, did it all fit in memory?

How did you measure cycles per seek?

What is numBlocks?  The total number of blocks the dataset fit in?

Throughput is better for sure.. what is latency like?

Excellent stuff Matt.




                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Attachment: HBASE-4676-prefix-tree-trunk-v6.patch
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141676#comment-13141676 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

More thorough benchmark shared here: https://docs.google.com/spreadsheet/ccc?key=0Ah5tKh7-sVXYdGkyWm9WenhSRzVfd2U5VC1XNF9tekE

cell format: Key[int,int,string,string]Value[VInt]

avg raw key bytes: 52
avg raw value bytes: 1
avg cells/row: ~10

The TRIE compressor gives ~6x compression when all the timestamps are the same.  For this table we would set the raw block size at 256KB or 1MB which gives encoded sizes of 39KB or 178KB.  

PREFIX compressor is doing 2x compression on this test because I don't think it gets too involved with qualifiers or timestamps.

----
Encoding MB/s are:
NONE: 1000
PREFIX: 275
TRIE: 15

That is purely because of CPU because everything is in my linux page cache.  TRIE encoding could be improved quite a bit, but will likely stay slower than the others.  Although if the scanners/comparators could be improved down the road, then it may speed up compaction significantly.

-----
Scanning MB/s are:
NONE: 3700
PREFIX: 1800
TRIE: 800

Trie is slower for a scan through all cells because of more complex decoding.  However, it could be much faster for things like row counting because it can quickly jump from one row to the next without iterating cells in between.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.94.0
>            Reporter: Matt Corgan
>         Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/delta-encoding-plus-trie
> https://github.com/hotpads/hbase-prefix-trie
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475934#comment-13475934 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

I have all known bugs worked out.  All tests are passing including those internal to the hbase-prefix-tree module, and existing DataBlockEncoding tests in hbase-server, such as TestEncodedSeekers.  

Split into 3 patches for reviewboard:
https://reviews.apache.org/r/7589/
https://reviews.apache.org/r/7591/
https://reviews.apache.org/r/7592/

The prefix-tree branch on my github repo contains the latest from trunk as of this comment.
http://github.com/hotpads/hbase/tree/prefix-tree
{code}
git clone -b prefix-tree https://hotpads@github.com/hotpads/hbase.git hbase-prefix-tree
{code}
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466624#comment-13466624 ] 

Ted Yu commented on HBASE-4676:
-------------------------------

[~mikhail]: Can you provide your insight on this feature please ?
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476011#comment-13476011 ] 

Matt Corgan commented on HBASE-4676:
------------------------------------

hmm.  Didn't think i changed anything since those tests passed.  Will look into it.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251272#comment-13251272 ] 

Zhihong Yu commented on HBASE-4676:
-----------------------------------

Thanks for the update, Matt.

PrefixTrieDataBlockEncoder is currently in the jar.
If PrefixTrie related java classes are in the patch, that would make understanding your implementation easier.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493779#comment-13493779 ] 

Hadoop QA commented on HBASE-4676:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12552778/HBASE-4676-prefix-tree-trunk-v4.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 142 new or modified tests.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 2.0 profile.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 103 warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 58 new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//console

This message is automatically generated.
                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4676) Prefix Compression - Trie data block encoding

Posted by "Matt Corgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Corgan updated HBASE-4676:
-------------------------------

    Status: Patch Available  (was: Open)
    
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, Performance, regionserver
>    Affects Versions: 0.96.0
>            Reporter: Matt Corgan
>            Assignee: Matt Corgan
>         Attachments: HBASE-4676-0.94-v1.patch, HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata heavy, so there can be tremendous memory bloat for many common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every time you double the datablock size, average seek time (or average cpu consumption) goes up by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB or more may be more efficient from a disk access and block-cache perspective in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie, then the column trie, then the timestamp deltas, and then then all the values.  Most work is done in the row trie, where every leaf node (corresponding to a row) contains a list of offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix, etc) compression on the trie nodes at the cost of slower write speed.  Even further compression could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level of optimization as the read path.  Work will need to be done to optimize the data structures used for encoding and could probably show a 10x increase.  It will still be slower than delta encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within half or a quarter of its max read speed) the way that hbase currently uses it is far from optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between.  When that is accomplished it will also allow much faster compactions because the full row key will not have to be compared as often as it is now.
> Current code is on github.  The trie code is in a separate project than the slightly modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied, and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira