You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "David G. Boney" <db...@semanticartifacts.com> on 2011/01/31 05:41:38 UTC

Simple Compression Idea

I propose a simple idea for compression using a compressed string datatype. 

The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.

I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.

Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.

Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.

It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.

This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com





Re: Simple Compression Idea

Posted by Terje Marthinussen <tm...@gmail.com>.
There is a lot of overhead in the serialized data itself (just have a look at a sstable file).

It would be great to be able to compress at the byte array level rather than string.

Regards,
Terje

On 1 Feb 2011, at 03:15, "David G. Boney" <db...@semanticartifacts.com> wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, the modeling is separate from the coding. A standard arrangement is to have a 0-order model, frequencies of individual bytes, 1-order model, frequencies of two byte occurrences, and 2-order models, frequencies of three byte occurrences.  For multibyte unicode encodings, which almost all fall in the two or three byte encodings, the higher order models should capture the statistics. Having said than, there is nothing to prevent someone from developing a UTF-8 specific model for capturing the statistics better.
> 
> An adaptive arithmetic coding model starts with the probabilities of all the symbols being equal. This results in all the symbols using 8-bits per encoding. The model adapts with each symbol it sees but it takes seeing some data to start seeing compression. I do not know, or have any reports from the literature, on how many bytes, on average, it takes to start seeing compression. All the papers I have seen have studied large blocks of text. 
> 
> My assumption is that the majority of strings to compress in Cassandra will be short string, less then 32 bytes, to medium length strings, less than, 256 bytes. An example of short strings might be column names. An example of medium length strings would be URLs. My assumption is that most short strings would not get much compression from adaptive arithmetic coding compression but medium length to longer strings would. In an effort to get higher compression on the short strings, static encoding methods could be used. A static encoding method uses a hard coded frequency table for the bytes. The start of the compressed string could have a 2 bit code, 00-no compression, 01-static, default probability table, 10-static, locale probability table, 11-adaptive coding. The default static coding would be based on English. For the static locale probability table, the 2 bit code would need to be followed by a code table, indicating the probability table to use for a particular language. The locale probability tables would be stored in the compressor jar file as a separate class for each locale supported (This issue needs more analysis, what I don't think is effective is to load the probability tables from disk each time a new compressor is created). During the encoding phase, both adaptive and static coding of the string would occur. In general, compression with a static coding table is very fast. static codig is basically a table lookup from a table with 256 entries. It is the adaptive coding that is more computationally intensive. Furthermore, you could place a limit on the use of static coding, say strings less than 256 bytes. This would need to be set empirically. The shorter string from the two methods is used as the compressed string. There is no additional burden on the decoding side, since you know the type of compression encoding.
> 
> It might be the case that block compressors can achieve greater compression ratios. From what I have read on the mailing list and JIRA, it appears that there is a lot of pain involved to implement block based compressors. This method, a compressed string data type, is presented as a method that is minimally invasive to the Cassandra architecture. Since everything is already stored as a byte array, nothing would need to be changed in the internal data types. Furthermore, no surgery of the internal tables is need. The main addition is the development of comparators, for keys and column headings, that know how to decompress the byte arrays before comparing them. There is also the additional work on writing the codex but there are a number of examples in C and Java from which to draw. Moffat's web site would be a good place to start.
> 
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
> 
>> On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com> wrote:
>>> I propose a simple idea for compression using a compressed string datatype.
>>> 
>>> The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.
>>> 
>>> I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.
>>> 
>>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>>> 
>>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>>> 
>>> It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.
>> 
>> I have not read those papers but how do they and their reported
>> compressions apply to the unicode characters that java strings are
>> stored as?
>> 
>> -Stephen
>> 
>>> 
>>> This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
>>> -------------
>>> Sincerely,
>>> David G. Boney
>>> dboney1@semanticartifacts.com
>>> http://www.semanticartifacts.com
>>> 
>>> 
>>> 
>>> 
>>> 
> 

Re: Simple Compression Idea

Posted by "David G. Boney" <db...@semanticartifacts.com>.
Is the partitioner the only code that does comparisons on the keys of a column family? What about get_range_slices(), does it only use the partitioner's comparison method?
-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com




On Jan 31, 2011, at 12:15 PM, David G. Boney wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, the modeling is separate from the coding. A standard arrangement is to have a 0-order model, frequencies of individual bytes, 1-order model, frequencies of two byte occurrences, and 2-order models, frequencies of three byte occurrences.  For multibyte unicode encodings, which almost all fall in the two or three byte encodings, the higher order models should capture the statistics. Having said than, there is nothing to prevent someone from developing a UTF-8 specific model for capturing the statistics better.
> 
> An adaptive arithmetic coding model starts with the probabilities of all the symbols being equal. This results in all the symbols using 8-bits per encoding. The model adapts with each symbol it sees but it takes seeing some data to start seeing compression. I do not know, or have any reports from the literature, on how many bytes, on average, it takes to start seeing compression. All the papers I have seen have studied large blocks of text. 
> 
> My assumption is that the majority of strings to compress in Cassandra will be short string, less then 32 bytes, to medium length strings, less than, 256 bytes. An example of short strings might be column names. An example of medium length strings would be URLs. My assumption is that most short strings would not get much compression from adaptive arithmetic coding compression but medium length to longer strings would. In an effort to get higher compression on the short strings, static encoding methods could be used. A static encoding method uses a hard coded frequency table for the bytes. The start of the compressed string could have a 2 bit code, 00-no compression, 01-static, default probability table, 10-static, locale probability table, 11-adaptive coding. The default static coding would be based on English. For the static locale probability table, the 2 bit code would need to be followed by a code table, indicating the probability table to use for a particular language. The locale probability tables would be stored in the compressor jar file as a separate class for each locale supported (This issue needs more analysis, what I don't think is effective is to load the probability tables from disk each time a new compressor is created). During the encoding phase, both adaptive and static coding of the string would occur. In general, compression with a static coding table is very fast. static codig is basically a table lookup from a table with 256 entries. It is the adaptive coding that is more computationally intensive. Furthermore, you could place a limit on the use of static coding, say strings less than 256 bytes. This would need to be set empirically. The shorter string from the two methods is used as the compressed string. There is no additional burden on the decoding side, since you know the type of compression encoding.
> 
> It might be the case that block compressors can achieve greater compression ratios. From what I have read on the mailing list and JIRA, it appears that there is a lot of pain involved to implement block based compressors. This method, a compressed string data type, is presented as a method that is minimally invasive to the Cassandra architecture. Since everything is already stored as a byte array, nothing would need to be changed in the internal data types. Furthermore, no surgery of the internal tables is need. The main addition is the development of comparators, for keys and column headings, that know how to decompress the byte arrays before comparing them. There is also the additional work on writing the codex but there are a number of examples in C and Java from which to draw. Moffat's web site would be a good place to start.
> 
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
> 
>> On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com> wrote:
>>> I propose a simple idea for compression using a compressed string datatype.
>>> 
>>> The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.
>>> 
>>> I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.
>>> 
>>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>>> 
>>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>>> 
>>> It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.
>> 
>> I have not read those papers but how do they and their reported
>> compressions apply to the unicode characters that java strings are
>> stored as?
>> 
>> -Stephen
>> 
>>> 
>>> This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
>>> -------------
>>> Sincerely,
>>> David G. Boney
>>> dboney1@semanticartifacts.com
>>> http://www.semanticartifacts.com
>>> 
>>> 
>>> 
>>> 
>>> 
> 


Re: Simple Compression Idea

Posted by "David G. Boney" <db...@semanticartifacts.com>.
Is the partitioner the only code that does comparisons on the keys of a column family? What about get_range_slices(), does it only use the partitioner's comparison method?
-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com




On Jan 31, 2011, at 12:15 PM, David G. Boney wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, the modeling is separate from the coding. A standard arrangement is to have a 0-order model, frequencies of individual bytes, 1-order model, frequencies of two byte occurrences, and 2-order models, frequencies of three byte occurrences.  For multibyte unicode encodings, which almost all fall in the two or three byte encodings, the higher order models should capture the statistics. Having said than, there is nothing to prevent someone from developing a UTF-8 specific model for capturing the statistics better.
> 
> An adaptive arithmetic coding model starts with the probabilities of all the symbols being equal. This results in all the symbols using 8-bits per encoding. The model adapts with each symbol it sees but it takes seeing some data to start seeing compression. I do not know, or have any reports from the literature, on how many bytes, on average, it takes to start seeing compression. All the papers I have seen have studied large blocks of text. 
> 
> My assumption is that the majority of strings to compress in Cassandra will be short string, less then 32 bytes, to medium length strings, less than, 256 bytes. An example of short strings might be column names. An example of medium length strings would be URLs. My assumption is that most short strings would not get much compression from adaptive arithmetic coding compression but medium length to longer strings would. In an effort to get higher compression on the short strings, static encoding methods could be used. A static encoding method uses a hard coded frequency table for the bytes. The start of the compressed string could have a 2 bit code, 00-no compression, 01-static, default probability table, 10-static, locale probability table, 11-adaptive coding. The default static coding would be based on English. For the static locale probability table, the 2 bit code would need to be followed by a code table, indicating the probability table to use for a particular language. The locale probability tables would be stored in the compressor jar file as a separate class for each locale supported (This issue needs more analysis, what I don't think is effective is to load the probability tables from disk each time a new compressor is created). During the encoding phase, both adaptive and static coding of the string would occur. In general, compression with a static coding table is very fast. static codig is basically a table lookup from a table with 256 entries. It is the adaptive coding that is more computationally intensive. Furthermore, you could place a limit on the use of static coding, say strings less than 256 bytes. This would need to be set empirically. The shorter string from the two methods is used as the compressed string. There is no additional burden on the decoding side, since you know the type of compression encoding.
> 
> It might be the case that block compressors can achieve greater compression ratios. From what I have read on the mailing list and JIRA, it appears that there is a lot of pain involved to implement block based compressors. This method, a compressed string data type, is presented as a method that is minimally invasive to the Cassandra architecture. Since everything is already stored as a byte array, nothing would need to be changed in the internal data types. Furthermore, no surgery of the internal tables is need. The main addition is the development of comparators, for keys and column headings, that know how to decompress the byte arrays before comparing them. There is also the additional work on writing the codex but there are a number of examples in C and Java from which to draw. Moffat's web site would be a good place to start.
> 
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
> 
>> On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com> wrote:
>>> I propose a simple idea for compression using a compressed string datatype.
>>> 
>>> The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.
>>> 
>>> I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.
>>> 
>>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>>> 
>>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>>> 
>>> It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.
>> 
>> I have not read those papers but how do they and their reported
>> compressions apply to the unicode characters that java strings are
>> stored as?
>> 
>> -Stephen
>> 
>>> 
>>> This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
>>> -------------
>>> Sincerely,
>>> David G. Boney
>>> dboney1@semanticartifacts.com
>>> http://www.semanticartifacts.com
>>> 
>>> 
>>> 
>>> 
>>> 
> 


Re: Simple Compression Idea

Posted by Mike Malone <mj...@gmail.com>.
I don't see anything inherently wrong with your proposal, it would almost
definitely be beneficial in certain scenarios. We use what could be called
"static compression" (golomb-esque encodings) for some data types on our
Cassandra clusters. It's useful for representing things like full precision
decimals efficiently. It sounds like your solution would be a more generic
mechanism for doing that sort of thing.

However, unless I'm misunderstanding things, I don't think your proposal is
a sufficient solution to compression in Cassandra in general. Specific
algorithms aside, there's simply too much regularity (i.e., opportunity for
compression) in the data structures and metadata that are being written to
disk. For instance: multiple rows in a column family often contain columns
with the same name. And column names in a row often share a common prefix.
This sort of regularity can be exploited by a block-level compression
method, but not by a mechanism that simply re-encodes column values.

Mike

On Mon, Jan 31, 2011 at 10:15 AM, David G. Boney <
dboney1@semanticartifacts.com> wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding
> compression, the modeling is separate from the coding. A standard
> arrangement is to have a 0-order model, frequencies of individual bytes,
> 1-order model, frequencies of two byte occurrences, and 2-order models,
> frequencies of three byte occurrences.  For multibyte unicode encodings,
> which almost all fall in the two or three byte encodings, the higher order
> models should capture the statistics. Having said than, there is nothing to
> prevent someone from developing a UTF-8 specific model for capturing the
> statistics better.
>
> An adaptive arithmetic coding model starts with the probabilities of all
> the symbols being equal. This results in all the symbols using 8-bits per
> encoding. The model adapts with each symbol it sees but it takes seeing some
> data to start seeing compression. I do not know, or have any reports from
> the literature, on how many bytes, on average, it takes to start seeing
> compression. All the papers I have seen have studied large blocks of text.
>
> My assumption is that the majority of strings to compress in Cassandra will
> be short string, less then 32 bytes, to medium length strings, less than,
> 256 bytes. An example of short strings might be column names. An example of
> medium length strings would be URLs. My assumption is that most short
> strings would not get much compression from adaptive arithmetic coding
> compression but medium length to longer strings would. In an effort to get
> higher compression on the short strings, static encoding methods could be
> used. A static encoding method uses a hard coded frequency table for the
> bytes. The start of the compressed string could have a 2 bit code, 00-no
> compression, 01-static, default probability table, 10-static, locale
> probability table, 11-adaptive coding. The default static coding would be
> based on English. For the static locale probability table, the 2 bit code
> would need to be followed by a code table, indicating the probability table
> to use for a particular language. The locale probability tables would be
> stored in the compressor jar file as a separate class for each locale
> supported (This issue needs more analysis, what I don't think is effective
> is to load the probability tables from disk each time a new compressor is
> created). During the encoding phase, both adaptive and static coding of the
> string would occur. In general, compression with a static coding table is
> very fast. static codig is basically a table lookup from a table with 256
> entries. It is the adaptive coding that is more computationally intensive.
> Furthermore, you could place a limit on the use of static coding, say
> strings less than 256 bytes. This would need to be set empirically. The
> shorter string from the two methods is used as the compressed string. There
> is no additional burden on the decoding side, since you know the type of
> compression encoding.
>
> It might be the case that block compressors can achieve greater compression
> ratios. From what I have read on the mailing list and JIRA, it appears that
> there is a lot of pain involved to implement block based compressors. This
> method, a compressed string data type, is presented as a method that is
> minimally invasive to the Cassandra architecture. Since everything is
> already stored as a byte array, nothing would need to be changed in the
> internal data types. Furthermore, no surgery of the internal tables is need.
> The main addition is the development of comparators, for keys and column
> headings, that know how to decompress the byte arrays before comparing them.
> There is also the additional work on writing the codex but there are a
> number of examples in C and Java from which to draw. Moffat's web site would
> be a good place to start.
>
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
>
> > On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com>
> wrote:
> >> I propose a simple idea for compression using a compressed string
> datatype.
> >>
> >> The compressed string datatype could be implemented for column family
> keys by creating a compressed string ordered partitioner. The compressed
> string ordered partitioner works by decompressing the string and then
> applying an ordered partitioner for strings to the decompressed string. The
> hash based partitioner would be used with the compressed string without any
> modification. The compressed string datatype could be implemented for column
> keys by creating a compressed string comparator. A compressed string
> comparator would work by decompressing the string and then applying a string
> comparator. The compressed string datatype could be implemented for column
> values. The compressed string datatype would be an internal datatype for
> Cassandra. The compressed string would be converted to a string before
> returning the value to a client. I suppose you could also have an option of
> passing the compressed form back to the client if the client wanted it that
> way.
> >>
> >> I propose using an adaptive arithmetic coding compressor. This type of
> compression can be done a byte at a time. It will build a compression model
> only on the string that is presented, a byte at a time. See the below
> papers.
> >>
> >> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic
> Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
> >>
> >> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic
> Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp.
> 520-540.
> >>
> >> It has been reported that arithmetic coding based compression applied to
> text can get compression ratios of up to 2.2 bits per character. Assuming
> you only get 4 bits per character because of short strings. This would be a
> 50% compression of text data, including keys and column names. Many
> applications would benefit. It should speed up the overall operation of
> Cassandra because you would be moving significantly less data through the
> system.
> >
> > I have not read those papers but how do they and their reported
> > compressions apply to the unicode characters that java strings are
> > stored as?
> >
> > -Stephen
> >
> >>
> >> This would provide a compression option that could be implemented
> without any redesign to the internal structure of Cassandra except for the a
> new partitioner class, a new comparator class, a new datatype class, and the
> compression class.
> >> -------------
> >> Sincerely,
> >> David G. Boney
> >> dboney1@semanticartifacts.com
> >> http://www.semanticartifacts.com
> >>
> >>
> >>
> >>
> >>
>
>

Re: Simple Compression Idea

Posted by "David G. Boney" <db...@semanticartifacts.com>.
In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, the modeling is separate from the coding. A standard arrangement is to have a 0-order model, frequencies of individual bytes, 1-order model, frequencies of two byte occurrences, and 2-order models, frequencies of three byte occurrences.  For multibyte unicode encodings, which almost all fall in the two or three byte encodings, the higher order models should capture the statistics. Having said than, there is nothing to prevent someone from developing a UTF-8 specific model for capturing the statistics better.

An adaptive arithmetic coding model starts with the probabilities of all the symbols being equal. This results in all the symbols using 8-bits per encoding. The model adapts with each symbol it sees but it takes seeing some data to start seeing compression. I do not know, or have any reports from the literature, on how many bytes, on average, it takes to start seeing compression. All the papers I have seen have studied large blocks of text. 

My assumption is that the majority of strings to compress in Cassandra will be short string, less then 32 bytes, to medium length strings, less than, 256 bytes. An example of short strings might be column names. An example of medium length strings would be URLs. My assumption is that most short strings would not get much compression from adaptive arithmetic coding compression but medium length to longer strings would. In an effort to get higher compression on the short strings, static encoding methods could be used. A static encoding method uses a hard coded frequency table for the bytes. The start of the compressed string could have a 2 bit code, 00-no compression, 01-static, default probability table, 10-static, locale probability table, 11-adaptive coding. The default static coding would be based on English. For the static locale probability table, the 2 bit code would need to be followed by a code table, indicating the probability table to use for a particular language. The locale probability tables would be stored in the compressor jar file as a separate class for each locale supported (This issue needs more analysis, what I don't think is effective is to load the probability tables from disk each time a new compressor is created). During the encoding phase, both adaptive and static coding of the string would occur. In general, compression with a static coding table is very fast. static codig is basically a table lookup from a table with 256 entries. It is the adaptive coding that is more computationally intensive. Furthermore, you could place a limit on the use of static coding, say strings less than 256 bytes. This would need to be set empirically. The shorter string from the two methods is used as the compressed string. There is no additional burden on the decoding side, since you know the type of compression encoding.

It might be the case that block compressors can achieve greater compression ratios. From what I have read on the mailing list and JIRA, it appears that there is a lot of pain involved to implement block based compressors. This method, a compressed string data type, is presented as a method that is minimally invasive to the Cassandra architecture. Since everything is already stored as a byte array, nothing would need to be changed in the internal data types. Furthermore, no surgery of the internal tables is need. The main addition is the development of comparators, for keys and column headings, that know how to decompress the byte arrays before comparing them. There is also the additional work on writing the codex but there are a number of examples in C and Java from which to draw. Moffat's web site would be a good place to start.

-------------
Sincerely,
David G. Boney
dboney1@semanticartifacts.com
http://www.semanticartifacts.com




On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:

> On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com> wrote:
>> I propose a simple idea for compression using a compressed string datatype.
>> 
>> The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.
>> 
>> I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.
>> 
>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>> 
>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>> 
>> It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.
> 
> I have not read those papers but how do they and their reported
> compressions apply to the unicode characters that java strings are
> stored as?
> 
> -Stephen
> 
>> 
>> This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
>> -------------
>> Sincerely,
>> David G. Boney
>> dboney1@semanticartifacts.com
>> http://www.semanticartifacts.com
>> 
>> 
>> 
>> 
>> 


Re: Simple Compression Idea

Posted by Stephen Connolly <st...@gmail.com>.
On 31 January 2011 04:41, David G. Boney <db...@semanticartifacts.com> wrote:
> I propose a simple idea for compression using a compressed string datatype.
>
> The compressed string datatype could be implemented for column family keys by creating a compressed string ordered partitioner. The compressed string ordered partitioner works by decompressing the string and then applying an ordered partitioner for strings to the decompressed string. The hash based partitioner would be used with the compressed string without any modification. The compressed string datatype could be implemented for column keys by creating a compressed string comparator. A compressed string comparator would work by decompressing the string and then applying a string comparator. The compressed string datatype could be implemented for column values. The compressed string datatype would be an internal datatype for Cassandra. The compressed string would be converted to a string before returning the value to a client. I suppose you could also have an option of passing the compressed form back to the client if the client wanted it that way.
>
> I propose using an adaptive arithmetic coding compressor. This type of compression can be done a byte at a time. It will build a compression model only on the string that is presented, a byte at a time. See the below papers.
>
> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>
> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>
> It has been reported that arithmetic coding based compression applied to text can get compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character because of short strings. This would be a 50% compression of text data, including keys and column names. Many applications would benefit. It should speed up the overall operation of Cassandra because you would be moving significantly less data through the system.

I have not read those papers but how do they and their reported
compressions apply to the unicode characters that java strings are
stored as?

-Stephen

>
> This would provide a compression option that could be implemented without any redesign to the internal structure of Cassandra except for the a new partitioner class, a new comparator class, a new datatype class, and the compression class.
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
>