You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Prakash Kadel <pr...@gmail.com> on 2013/04/03 15:42:30 UTC

should i use compression?

Hello,
    I have a question.
    I have a table where i store data in the column qualifiers(the values itself are null).
    I just have 1 column family.
   The number of columns per row is variable (1~ few thousands)

Currently i don't use compression or the data_block_encoding.

Should i?
I want to have faster reads.

Please suggest.


Sincerely,
Prakash Kadel

Re: should i use compression?

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
+1 for Ted´s advice.
Using compression can save a lot of space in memory and disc, so it´s a
good recommendation.



2013/4/3 Ted Yu <yu...@gmail.com>

> You should use data block encoding (in 0.94.x releases only). It is helpful
> for reads.
>
> You can also enable compression.
>
> Cheers
>
>
> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <prakash.kadel@gmail.com
> >wrote:
>
> > Hello,
> >     I have a question.
> >     I have a table where i store data in the column qualifiers(the values
> > itself are null).
> >     I just have 1 column family.
> >    The number of columns per row is variable (1~ few thousands)
> >
> > Currently i don't use compression or the data_block_encoding.
> >
> > Should i?
> > I want to have faster reads.
> >
> > Please suggest.
> >
> >
> > Sincerely,
> > Prakash Kadel
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: should i use compression?

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
Here´s the API documentation:

*FAST_DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html

"Encoder similar to
DiffKeyDeltaEncoder<http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html>
but
supposedly faster.
Compress using:
 - store size of common prefix
- save column family once in the first KeyValue
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of prefix timestamp with previous KeyValue's
timestamp
- one bit which allow to omit value if it is the same Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp suffix - 1 byte: type (only if FLAG_SAME_TYPE is not
set in the flag)
- ... bytes: value (only if FLAG_SAME_VALUE is not set in the flag)"

*DIFF*:
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.html

"Compress using:
- store size of common prefix
- save column family once, it is same within HFile
- use integer compression for key, value and prefix (7-bit encoding)
- use bits to avoid duplication key length, value length and type if it
same as previous
- store in 3 bits length of timestamp field
- allow diff in timestamp instead of actual value Format:
- 1 byte: flag
- 1-5 bytes: key length (only if FLAG_SAME_KEY_LENGTH is not set in flag)
- 1-5 bytes: value length (only if FLAG_SAME_VALUE_LENGTH is not set in
flag)
- 1-5 bytes: prefix length
- ... bytes: rest of the row (if prefix length is small enough)
- ... bytes: qualifier (or suffix depending on prefix length)
- 1-8 bytes: timestamp or diff - 1 byte: type (only if FLAG_SAME_TYPE is
not set in the flag) - ... bytes: value"

I was reading the FAQ´s and there is not anything related to this topic. It
would be nice to include it in the documentation.

Lars, What do you think? It would be nice if you could write a detailed
blog post about this topic.





2013/4/3 Jean-Marc Spaggiari <je...@spaggiari.org>

> I read the JIRA already but it was not clear to me. However Cloudera's
> link is very clear. Thanks for that. Any idea what's the difference
> between DIFF and FAST_DIFF?
>
> 2013/4/3 Marcos Luis Ortiz Valmaseda <ma...@gmail.com>:
> > You can read this JIra issue for this too:
> > https://issues.apache.org/jira/browse/HBASE-4218
> >
> >
> >
> > 2013/4/3 Marcos Luis Ortiz Valmaseda <ma...@gmail.com>
> >>
> >> Regards, Jean-Marc.
> >> The best resource that I found for this is a great blog post called
> Apache
> >> HBase I/O - HFile  from Matteo Bertozzi in Cloudera´s blog. Here´s the
> link:
> >> http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
> >>
> >>
> >>
> >>
> >> 2013/4/3 Jean-Marc Spaggiari <je...@spaggiari.org>
> >>>
> >>> Is there any documentation anywhere regarding the differences between
> >>> PREFIX, DIFF and FAST_DIFF?
> >>>
> >>> 2013/4/3 prakash kadel <pr...@gmail.com>:
> >>> > thank you very much.
> >>> > i will try with snappy compression with data_block_encoding
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell
> >>> > <ke...@cloudera.com>wrote:
> >>> >
> >>> >> Prakash,
> >>> >>
> >>> >>   Yes, I would recommend Snappy Compression.
> >>> >>
> >>> >> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel
> >>> >> <pr...@gmail.com>
> >>> >> wrote:
> >>> >> > Thanks,
> >>> >> >     is there any specific compression that is recommended of the
> use
> >>> >> case i have?
> >>> >> >    Since my values are all null will compression help?
> >>> >> >
> >>> >> >  I am thinking of using prefix data_block_encoding..
> >>> >> > Sincerely,
> >>> >> > Prakash Kadel
> >>> >> >
> >>> >> >
> >>> >> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
> >>> >> >
> >>> >> >> You should use data block encoding (in 0.94.x releases only). It
> is
> >>> >> helpful
> >>> >> >> for reads.
> >>> >> >>
> >>> >> >> You can also enable compression.
> >>> >> >>
> >>> >> >> Cheers
> >>> >> >>
> >>> >> >>
> >>> >> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel
> >>> >> >> <prakash.kadel@gmail.com
> >>> >> >wrote:
> >>> >> >>
> >>> >> >>> Hello,
> >>> >> >>>    I have a question.
> >>> >> >>>    I have a table where i store data in the column
> qualifiers(the
> >>> >> values
> >>> >> >>> itself are null).
> >>> >> >>>    I just have 1 column family.
> >>> >> >>>   The number of columns per row is variable (1~ few thousands)
> >>> >> >>>
> >>> >> >>> Currently i don't use compression or the data_block_encoding.
> >>> >> >>>
> >>> >> >>> Should i?
> >>> >> >>> I want to have faster reads.
> >>> >> >>>
> >>> >> >>> Please suggest.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Sincerely,
> >>> >> >>> Prakash Kadel
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Kevin O'Dell
> >>> >> Systems Engineer, Cloudera
> >>> >>
> >>
> >>
> >>
> >>
> >> --
> >> Marcos Ortiz Valmaseda,
> >> Data-Driven Product Manager at PDVSA
> >> Blog: http://dataddict.wordpress.com/
> >> LinkedIn: http://www.linkedin.com/in/marcosluis2186
> >> Twitter: @marcosluis2186
> >
> >
> >
> >
> > --
> > Marcos Ortiz Valmaseda,
> > Data-Driven Product Manager at PDVSA
> > Blog: http://dataddict.wordpress.com/
> > LinkedIn: http://www.linkedin.com/in/marcosluis2186
> > Twitter: @marcosluis2186
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: should i use compression?

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
You can read this JIra issue for this too:
https://issues.apache.org/jira/browse/HBASE-4218



2013/4/3 Marcos Luis Ortiz Valmaseda <ma...@gmail.com>

> Regards, Jean-Marc.
> The best resource that I found for this is a great blog post called Apache
> HBase I/O - HFile  from Matteo Bertozzi in Cloudera´s blog. Here´s the link:
> http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
>
>
>
>
> 2013/4/3 Jean-Marc Spaggiari <je...@spaggiari.org>
>
>> Is there any documentation anywhere regarding the differences between
>> PREFIX, DIFF and FAST_DIFF?
>>
>> 2013/4/3 prakash kadel <pr...@gmail.com>:
>> > thank you very much.
>> > i will try with snappy compression with data_block_encoding
>> >
>> >
>> >
>> >
>> > On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell <kevin.odell@cloudera.com
>> >wrote:
>> >
>> >> Prakash,
>> >>
>> >>   Yes, I would recommend Snappy Compression.
>> >>
>> >> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel <
>> prakash.kadel@gmail.com>
>> >> wrote:
>> >> > Thanks,
>> >> >     is there any specific compression that is recommended of the use
>> >> case i have?
>> >> >    Since my values are all null will compression help?
>> >> >
>> >> >  I am thinking of using prefix data_block_encoding..
>> >> > Sincerely,
>> >> > Prakash Kadel
>> >> >
>> >> >
>> >> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
>> >> >
>> >> >> You should use data block encoding (in 0.94.x releases only). It is
>> >> helpful
>> >> >> for reads.
>> >> >>
>> >> >> You can also enable compression.
>> >> >>
>> >> >> Cheers
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <
>> prakash.kadel@gmail.com
>> >> >wrote:
>> >> >>
>> >> >>> Hello,
>> >> >>>    I have a question.
>> >> >>>    I have a table where i store data in the column qualifiers(the
>> >> values
>> >> >>> itself are null).
>> >> >>>    I just have 1 column family.
>> >> >>>   The number of columns per row is variable (1~ few thousands)
>> >> >>>
>> >> >>> Currently i don't use compression or the data_block_encoding.
>> >> >>>
>> >> >>> Should i?
>> >> >>> I want to have faster reads.
>> >> >>>
>> >> >>> Please suggest.
>> >> >>>
>> >> >>>
>> >> >>> Sincerely,
>> >> >>> Prakash Kadel
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Kevin O'Dell
>> >> Systems Engineer, Cloudera
>> >>
>>
>
>
>
> --
> Marcos Ortiz Valmaseda,
> *Data-Driven Product Manager* at PDVSA
> *Blog*: http://dataddict.wordpress.com/
> *LinkedIn: *http://www.linkedin.com/in/marcosluis2186
> *Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: should i use compression?

Posted by Marcos Luis Ortiz Valmaseda <ma...@gmail.com>.
Regards, Jean-Marc.
The best resource that I found for this is a great blog post called Apache
HBase I/O - HFile  from Matteo Bertozzi in Cloudera´s blog. Here´s the link:
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/




2013/4/3 Jean-Marc Spaggiari <je...@spaggiari.org>

> Is there any documentation anywhere regarding the differences between
> PREFIX, DIFF and FAST_DIFF?
>
> 2013/4/3 prakash kadel <pr...@gmail.com>:
> > thank you very much.
> > i will try with snappy compression with data_block_encoding
> >
> >
> >
> >
> > On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
> >
> >> Prakash,
> >>
> >>   Yes, I would recommend Snappy Compression.
> >>
> >> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel <prakash.kadel@gmail.com
> >
> >> wrote:
> >> > Thanks,
> >> >     is there any specific compression that is recommended of the use
> >> case i have?
> >> >    Since my values are all null will compression help?
> >> >
> >> >  I am thinking of using prefix data_block_encoding..
> >> > Sincerely,
> >> > Prakash Kadel
> >> >
> >> >
> >> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
> >> >
> >> >> You should use data block encoding (in 0.94.x releases only). It is
> >> helpful
> >> >> for reads.
> >> >>
> >> >> You can also enable compression.
> >> >>
> >> >> Cheers
> >> >>
> >> >>
> >> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <
> prakash.kadel@gmail.com
> >> >wrote:
> >> >>
> >> >>> Hello,
> >> >>>    I have a question.
> >> >>>    I have a table where i store data in the column qualifiers(the
> >> values
> >> >>> itself are null).
> >> >>>    I just have 1 column family.
> >> >>>   The number of columns per row is variable (1~ few thousands)
> >> >>>
> >> >>> Currently i don't use compression or the data_block_encoding.
> >> >>>
> >> >>> Should i?
> >> >>> I want to have faster reads.
> >> >>>
> >> >>> Please suggest.
> >> >>>
> >> >>>
> >> >>> Sincerely,
> >> >>> Prakash Kadel
> >> >
> >>
> >>
> >>
> >> --
> >> Kevin O'Dell
> >> Systems Engineer, Cloudera
> >>
>



-- 
Marcos Ortiz Valmaseda,
*Data-Driven Product Manager* at PDVSA
*Blog*: http://dataddict.wordpress.com/
*LinkedIn: *http://www.linkedin.com/in/marcosluis2186
*Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: should i use compression?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Is there any documentation anywhere regarding the differences between
PREFIX, DIFF and FAST_DIFF?

2013/4/3 prakash kadel <pr...@gmail.com>:
> thank you very much.
> i will try with snappy compression with data_block_encoding
>
>
>
>
> On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell <ke...@cloudera.com>wrote:
>
>> Prakash,
>>
>>   Yes, I would recommend Snappy Compression.
>>
>> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel <pr...@gmail.com>
>> wrote:
>> > Thanks,
>> >     is there any specific compression that is recommended of the use
>> case i have?
>> >    Since my values are all null will compression help?
>> >
>> >  I am thinking of using prefix data_block_encoding..
>> > Sincerely,
>> > Prakash Kadel
>> >
>> >
>> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
>> >
>> >> You should use data block encoding (in 0.94.x releases only). It is
>> helpful
>> >> for reads.
>> >>
>> >> You can also enable compression.
>> >>
>> >> Cheers
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <prakash.kadel@gmail.com
>> >wrote:
>> >>
>> >>> Hello,
>> >>>    I have a question.
>> >>>    I have a table where i store data in the column qualifiers(the
>> values
>> >>> itself are null).
>> >>>    I just have 1 column family.
>> >>>   The number of columns per row is variable (1~ few thousands)
>> >>>
>> >>> Currently i don't use compression or the data_block_encoding.
>> >>>
>> >>> Should i?
>> >>> I want to have faster reads.
>> >>>
>> >>> Please suggest.
>> >>>
>> >>>
>> >>> Sincerely,
>> >>> Prakash Kadel
>> >
>>
>>
>>
>> --
>> Kevin O'Dell
>> Systems Engineer, Cloudera
>>

Re: should i use compression?

Posted by prakash kadel <pr...@gmail.com>.
thank you very much.
i will try with snappy compression with data_block_encoding




On Wed, Apr 3, 2013 at 11:21 PM, Kevin O'dell <ke...@cloudera.com>wrote:

> Prakash,
>
>   Yes, I would recommend Snappy Compression.
>
> On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel <pr...@gmail.com>
> wrote:
> > Thanks,
> >     is there any specific compression that is recommended of the use
> case i have?
> >    Since my values are all null will compression help?
> >
> >  I am thinking of using prefix data_block_encoding..
> > Sincerely,
> > Prakash Kadel
> >
> >
> > On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
> >
> >> You should use data block encoding (in 0.94.x releases only). It is
> helpful
> >> for reads.
> >>
> >> You can also enable compression.
> >>
> >> Cheers
> >>
> >>
> >> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <prakash.kadel@gmail.com
> >wrote:
> >>
> >>> Hello,
> >>>    I have a question.
> >>>    I have a table where i store data in the column qualifiers(the
> values
> >>> itself are null).
> >>>    I just have 1 column family.
> >>>   The number of columns per row is variable (1~ few thousands)
> >>>
> >>> Currently i don't use compression or the data_block_encoding.
> >>>
> >>> Should i?
> >>> I want to have faster reads.
> >>>
> >>> Please suggest.
> >>>
> >>>
> >>> Sincerely,
> >>> Prakash Kadel
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>

Re: should i use compression?

Posted by Kevin O'dell <ke...@cloudera.com>.
Prakash,

  Yes, I would recommend Snappy Compression.

On Wed, Apr 3, 2013 at 10:18 AM, Prakash Kadel <pr...@gmail.com> wrote:
> Thanks,
>     is there any specific compression that is recommended of the use case i have?
>    Since my values are all null will compression help?
>
>  I am thinking of using prefix data_block_encoding..
> Sincerely,
> Prakash Kadel
>
>
> On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
>
>> You should use data block encoding (in 0.94.x releases only). It is helpful
>> for reads.
>>
>> You can also enable compression.
>>
>> Cheers
>>
>>
>> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <pr...@gmail.com>wrote:
>>
>>> Hello,
>>>    I have a question.
>>>    I have a table where i store data in the column qualifiers(the values
>>> itself are null).
>>>    I just have 1 column family.
>>>   The number of columns per row is variable (1~ few thousands)
>>>
>>> Currently i don't use compression or the data_block_encoding.
>>>
>>> Should i?
>>> I want to have faster reads.
>>>
>>> Please suggest.
>>>
>>>
>>> Sincerely,
>>> Prakash Kadel
>



-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: should i use compression?

Posted by Ted Yu <yu...@gmail.com>.
Another commonly used encoding is FAST_DIFF

Cheers


On Wed, Apr 3, 2013 at 7:18 AM, Prakash Kadel <pr...@gmail.com>wrote:

> Thanks,
>     is there any specific compression that is recommended of the use case
> i have?
>    Since my values are all null will compression help?
>
>  I am thinking of using prefix data_block_encoding..
> Sincerely,
> Prakash Kadel
>
>
> On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:
>
> > You should use data block encoding (in 0.94.x releases only). It is
> helpful
> > for reads.
> >
> > You can also enable compression.
> >
> > Cheers
> >
> >
> > On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <prakash.kadel@gmail.com
> >wrote:
> >
> >> Hello,
> >>    I have a question.
> >>    I have a table where i store data in the column qualifiers(the values
> >> itself are null).
> >>    I just have 1 column family.
> >>   The number of columns per row is variable (1~ few thousands)
> >>
> >> Currently i don't use compression or the data_block_encoding.
> >>
> >> Should i?
> >> I want to have faster reads.
> >>
> >> Please suggest.
> >>
> >>
> >> Sincerely,
> >> Prakash Kadel
>
>

Re: should i use compression?

Posted by Prakash Kadel <pr...@gmail.com>.
Thanks,
    is there any specific compression that is recommended of the use case i have?
   Since my values are all null will compression help?
   
 I am thinking of using prefix data_block_encoding..
Sincerely,
Prakash Kadel


On Apr 3, 2013, at 10:55 PM, Ted Yu wrote:

> You should use data block encoding (in 0.94.x releases only). It is helpful
> for reads.
> 
> You can also enable compression.
> 
> Cheers
> 
> 
> On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <pr...@gmail.com>wrote:
> 
>> Hello,
>>    I have a question.
>>    I have a table where i store data in the column qualifiers(the values
>> itself are null).
>>    I just have 1 column family.
>>   The number of columns per row is variable (1~ few thousands)
>> 
>> Currently i don't use compression or the data_block_encoding.
>> 
>> Should i?
>> I want to have faster reads.
>> 
>> Please suggest.
>> 
>> 
>> Sincerely,
>> Prakash Kadel


Re: should i use compression?

Posted by Ted Yu <yu...@gmail.com>.
You should use data block encoding (in 0.94.x releases only). It is helpful
for reads.

You can also enable compression.

Cheers


On Wed, Apr 3, 2013 at 6:42 AM, Prakash Kadel <pr...@gmail.com>wrote:

> Hello,
>     I have a question.
>     I have a table where i store data in the column qualifiers(the values
> itself are null).
>     I just have 1 column family.
>    The number of columns per row is variable (1~ few thousands)
>
> Currently i don't use compression or the data_block_encoding.
>
> Should i?
> I want to have faster reads.
>
> Please suggest.
>
>
> Sincerely,
> Prakash Kadel