You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "roman.drapeko@baesystems.com" <ro...@baesystems.com> on 2015/08/24 18:58:59 UTC

visibility expression & column compression

Hi there,

My question is how Accumulo compression works in regards to visibility labels.

Is there any difference between "VeryLargeLargeLarge & AlsoLargeLargeLarge" and "A&B" expressions? Will it be internally compiled to a low data consuming structure?

Same question applies to column and qualifier names. Is there any difference?

The reason for this question is simple - we are trying to find out what would be the data utilization overhead for different approaches.

Regards
Roman
Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.

Re: visibility expression & column compression

Posted by Josh Elser <jo...@gmail.com>.
Visibility labels are not replaced with any other types of identifiers 
which means that, considering nothing else, a visibility label which has 
20 characters will take up more space than one that only has 2 
characters. This is a conscious decision to make sure it is completely 
obvious what the label on some data is without an external lookup table.


Accumulo uses two strategies to reduce the size of data on disk: run 
length encoding and a compression algorithm. The run-length encoding is 
used to prevent common prefixes in a sequential Keys from being stored 
multiple times. For example, given the following Keys

row1 cf:cq []
row2 cf:cq []

the RLE would prevent "row" from being stored a second time. Families 
and qualifiers would only be replaced with a back-reference if there is 
a common Key-prefix that extends into the family or qualifier.

A compression algorithm, GZ by default, is then applied to the result of 
the encoding. Snappy is another common compression algorithm used by 
Accumulo instances.

- Josh

roman.drapeko@baesystems.com wrote:
> Hi there,
>
> My question is how Accumulo compression works in regards to visibility
> labels.
>
> Is there any difference between ”VeryLargeLargeLarge &
> AlsoLargeLargeLarge” and “A&B” expressions? Will it be internally
> compiled to a low data consuming structure?
>
> Same question applies to column and qualifier names. Is there any
> difference?
>
> The reason for this question is simple – we are trying to find out what
> would be the data utilization overhead for different approaches.
>
> Regards
>
> Roman
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied
> Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.

Re: visibility expression & column compression

Posted by Christopher <ct...@apache.org>.
Resending (see below) due to brief ASF email outage.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Mon, Aug 24, 2015 at 2:54 PM, Christopher <ct...@apache.org> wrote:
> Accumulo has a few kinds of compression inside RFiles when apply to
> visibility expressions.
>
> First, there's the block compression in the file. This is going to be
> gzip, or another supported compression type. But, before that, we have
> a couple of ways to reduce the size of the data written:
>
> 1. if the visibility expression in of one key is exactly the same as
> the key which immediately preceded it, VE(K) == VE(K-1), the RFile
> writer stores a flag which instructs the reader to re-use the previous
> visibility expression, in lieu of the visibility expression itself.
>
> 2. in the case of non-exact matches, the RFile writer stores the
> number of bytes it shares with the previous key as a common prefix,
> and then the rest of the bytes which are different.
>
> (Note: these optimizations actually apply to the row, colfam, colqual,
> too, but you specifically asked about colvis.)
>
> What we don't do is create a lookup table or anything like that. We
> think it's really important that the visibility be stored with the
> data it protects, so that the visibility is always there for
> determining authorization to read it. So, we don't do anything beyond
> the few small optimizations during serialization, and certainly
> nothing that would separate the data too far from its visibility
> expression.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Mon, Aug 24, 2015 at 12:58 PM, roman.drapeko@baesystems.com
> <ro...@baesystems.com> wrote:
>> Hi there,
>>
>>
>>
>> My question is how Accumulo compression works in regards to visibility
>> labels.
>>
>>
>>
>> Is there any difference between ”VeryLargeLargeLarge & AlsoLargeLargeLarge”
>> and “A&B” expressions? Will it be internally compiled to a low data
>> consuming structure?
>>
>>
>>
>> Same question applies to column and qualifier names. Is there any
>> difference?
>>
>>
>>
>> The reason for this question is simple – we are trying to find out what
>> would be the data utilization overhead for different approaches.
>>
>>
>>
>> Regards
>>
>> Roman
>>
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in error
>> please notify the sender and destroy it immediately. Statements of intent
>> shall only become binding when confirmed in hard copy by an authorised
>> signatory. The contents of this email may relate to dealings with other
>> companies under the control of BAE Systems Applied Intelligence Limited,
>> details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.