You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Christopher Tubbs (JIRA)" <ji...@apache.org> on 2012/10/05 23:20:03 UTC

[jira] [Created] (ACCUMULO-790) RFile should compress using common prefixes of RowIDs

Christopher Tubbs created ACCUMULO-790:
------------------------------------------

             Summary: RFile should compress using common prefixes of RowIDs
                 Key: ACCUMULO-790
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
             Project: Accumulo
          Issue Type: Improvement
          Components: tserver
            Reporter: Christopher Tubbs
            Assignee: Keith Turner
             Fix For: 1.5.0


Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.

Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.

In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501306#comment-13501306 ] 

Hudson commented on ACCUMULO-790:
---------------------------------

Integrated in Accumulo-Trunk #560 (See [https://builds.apache.org/job/Accumulo-Trunk/560/])
    ACCUMULO-790 Added prefix compression for relative key files, per key component (Revision 1411745)

     Result = UNSTABLE
ctubbsii : 
Files : 
* /accumulo/trunk/.gitignore
* /accumulo/trunk/core/src/main/java/org/apache/accumulo/core/file/rfile/BlockIndex.java
* /accumulo/trunk/core/src/main/java/org/apache/accumulo/core/file/rfile/MultiLevelIndex.java
* /accumulo/trunk/core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java
* /accumulo/trunk/core/src/main/java/org/apache/accumulo/core/file/rfile/RelativeKey.java
* /accumulo/trunk/core/src/test/java/org/apache/accumulo/core/file/rfile/MultiLevelIndexTest.java
* /accumulo/trunk/core/src/test/java/org/apache/accumulo/core/file/rfile/RFileTest.java
* /accumulo/trunk/core/src/test/java/org/apache/accumulo/core/file/rfile/RelativeKeyTest.java

                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Christopher Tubbs
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Tubbs reassigned ACCUMULO-790:
------------------------------------------

    Assignee: Christopher Tubbs  (was: Keith Turner)
    
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Christopher Tubbs
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501300#comment-13501300 ] 

Christopher Tubbs commented on ACCUMULO-790:
--------------------------------------------

Submitted working and tested patch in commit r1411745
                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Christopher Tubbs
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501349#comment-13501349 ] 

Hudson commented on ACCUMULO-790:
---------------------------------

Integrated in Accumulo-Trunk #561 (See [https://builds.apache.org/job/Accumulo-Trunk/561/])
    ACCUMULO-790 Added missing version 6 test resource for testing prior version of RFile with latest code (Revision 1411754)

     Result = SUCCESS
ctubbsii : 
Files : 
* /accumulo/trunk/core/src/test/resources/org/apache/accumulo/core/file/rfile/ver_6.rf

                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Christopher Tubbs
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated ACCUMULO-790:
---------------------------------

    Labels: compression file hackathon optimization rfile  (was: compression file optimization rfile)
    
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Keith Turner (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491772#comment-13491772 ] 

Keith Turner commented on ACCUMULO-790:
---------------------------------------

When RFile was created not only was the file format considered, the end to end data process from client write to client read was also considered.   The encoding that is utilized in RFile is also utilized to send data over the wire.  RFile allowed much faster read than MapFile, but that speedup would not have mattered as much if the data could not be sent over the network faster.   See Key.compress() and Key.decompress().   If these changes further speed up read rates, can we do something to speed up transfer rates?  There may be nothing to do, but its something we can experiment with.


                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Christopher Tubbs
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Keith Turner (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13470674#comment-13470674 ] 

Keith Turner commented on ACCUMULO-790:
---------------------------------------

For timestamps we could experiment with subtracting the previous timestamp and writing out the result using writeVLong().
                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: compression, file, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475702#comment-13475702 ] 

Christopher Tubbs commented on ACCUMULO-790:
--------------------------------------------

After further investigation, we do need another byte to store the additional flags for the prefix compression. 3^5 > 2^7. (3^5 is 3 compression types for 5 dimensions, none/exact/prefix; 2^7 is the max number we can enumerate in the extra 7 bits of the delete byte).
                
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: compression, file, hackathon, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-790) RFile should compress using common prefixes of key elements

Posted by "Christopher Tubbs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Tubbs updated ACCUMULO-790:
---------------------------------------

    Summary: RFile should compress using common prefixes of key elements  (was: RFile should compress using common prefixes of RowIDs)
    
> RFile should compress using common prefixes of key elements
> -----------------------------------------------------------
>
>                 Key: ACCUMULO-790
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-790
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: compression, file, optimization, rfile
>             Fix For: 1.5.0
>
>
> Relative keys have proven themselves as a great way to compress within dimensions of the key. However, we could probably do better, since we know that our data is sorted lexicographically, we can make a reasonable assumption that we will get better compression if we only store the fact that a key (or portion of a key) has a common prefix with the previous key, even if it is not an exact match.
> Currently, in RFile, unused bits from the delete flag byte are being used to store flags that show whether an element of the key is exactly the same as the previous, or if it is different. We can change the semantics of these flags to store three states per element of the key: exact match as previous key, has a common prefix as previous key, no relative key compression. If we don't want to add a byte to store 2 bits for 3 states per element, we can just take the ordinal value of the unused 7 bits of the delete flag field and map it to an enumeration of relative key flags.
> In the case of a common prefix flag enabled for a given element of the current key when reading the RFile, we can interpret the first bytes of that element as a VInt expressing the length of the common prefix relative to the previous key's same element. Because this will add at least one byte to the the length of that element, we will not want to use the common prefix compression if the common prefix is less than 2 bytes. For less than 2 bytes in common (1 or 0 bytes in common), we'd select the no compression flag for that element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira