You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by shardul singh <sh...@gmail.com> on 2018/10/12 10:49:55 UTC

[1.5.2] Gzip Compression Support

Hi community,
Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as
the compression codec offered by carbon.
Some benefits of having Gzip compression codec are :

   1. Gzip offers reduced file size compared to other codec like snappy but
   at the cost of processing speed.
   2. Gzip is suitable for users who have cold data i.e. data which which
   is stored permanently and will be queried rarely.

I have created the jira issue for the same.
https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the
design document there.
Any suggestions regarding this are welcomed by the community.

Regards,
Shardul

Re: [1.5.2] Gzip Compression Support

Posted by manhua <ke...@qq.com>.

For all column compression, I have a problem about onHeap/offHeap compression
in carbon.
Snappy works different from zstd. I wonder whether this problem exists for
gz or not.
And how can we unify processing of different compressor?

## Problem ##
Recently I'm trapped in a problem when looking at zstd unsafe compress in
carbon.

Since zstd-jni 1.3.6-1 release and supports
Zstd.compressUnsafe(outputAddress, outputSize, inputAddress, inputSize,
COMPRESS_LEVEL), we can enable to do zstd unsafe compression in
UnsafeFixLengthColumnPage.java.

However, the query result is wrong for columns used
UnsafeFixLengthColumnPage. 


## Analyse ##
I found the root cause is LITTLE_ENDIAN/BIG_ENDIAN.

For onHeap/safe loading, zstd compressor in carbon always uses byte[], by
converting from/to different datatype.   --- This case is fine.
For offheap/unsafe loading, UnsafeFixLengthColumnPage calls
CarbonUnsafe.getUnsafe() to put value into memory(e.g. putShort, putInt,
putDouble...) and then do the rawcompress.    --- The key point here is that
unsafe.putXXX is related to endian

Take a simplified example for zstd in carbon:
Input: int[] {1,2}
onheap:  convert to byte[] 00010002 -> compressByte
offheap: putInt 10002000 -> rawCompress

decompression: unCompressByte -> convert to int[]
        
onheap/offheap just affects compress process, carbon uses same code to
decompress, so for above example the decompress result is different.

## What about Snappy ##
So, why snappy can deal with unsafe perfectly? 
I don't familiar with the jni coding. But after a glance to it, I think the
gap is that snappy offers API for datatypes like compress(int[] var0) and
uncompressIntArray(byte[] var0) and its implementation uses
GetDirectBufferAddress(is this endian related?)

take the same example above to apply on snappy in carbon:
Input: int[] {1,2}
onheap:  compressInt 10002000
offheap: putInt 10002000 

decompression: uncompressIntArray

## simple code to check snappy ## 
```
    int[] inData = new int[1];
    inData[0] = 1;
    byte[] safe_out = Snappy.compress(inData);

    // uncompress
    byte[] check1 = Snappy.uncompress(safe_out);
    System.out.println(ByteBuffer.wrap(check1).getInt(0));  // 16777216

    // uncompressIntArray
    int[] check2 = Snappy.uncompressIntArray(safe_out);
    System.out.println(check2[0]);  // 1
```

## Note ##
My nativeOrder is LITTLE_ENDIAN and java default is BIG_ENDIAN.





-----
Regards 
Manhua
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [1.5.2] Gzip Compression Support

Posted by Jacky Li <ja...@qq.com>.

Comment inline

> 在 2018年10月18日，下午1:49，shardul singh <sh...@gmail.com> 写道：
> 
> Hi,
> 1. No it doesn't support UncompressShort/Int, Short/Int array needs to be
> typecasted to byte array and then passed for compression.For uncompress we
> get the result as byte array that need to be typecasted to Short/Int array
> depending on requirement.

In PR2728, xuchuanyin modified the compress/uncompress interface to keep only compressByte, and modified the ColumnPage to use ByteArray instead of primitive data arrays, if this help you simplify the GZip PR, we should work on PR2728 and merge it. What do you think?

> 2. No it doesn't need uncompressed size.
> 3. Yes data copy is required during uncompression to avoid compressed data
> getting modified. Also required if the offset of the data is not 0.

Please check whether Gzip offers uncompression method that accept ByteBuffer, maybe we can move the position of the ByteBuffer and Gzip can uncompress start from the position we give? I remember ZSTD supports like this.

> 
> Regards,
> Shardul
> 
> On Thu, Oct 18, 2018 at 9:09 AM Jacky Li <ja...@qq.com> wrote:
> 
>> +1
>> 
>> I have some question:
>> 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray,
>> uncompresssIntArray?
>> 2. Does Gzip need uncompress size to allocate the target array before
>> uncompressing?
>> 3. Does you solution require data copy?
>> 
>> Regards,
>> Jacky
>> 
>>> 在 2018年10月12日，下午6:49，shardul singh <sh...@gmail.com> 写道：
>>> 
>>> Hi community,
>>> Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as
>>> the compression codec offered by carbon.
>>> Some benefits of having Gzip compression codec are :
>>> 
>>>  1. Gzip offers reduced file size compared to other codec like snappy
>> but
>>>  at the cost of processing speed.
>>>  2. Gzip is suitable for users who have cold data i.e. data which which
>>>  is stored permanently and will be queried rarely.
>>> 
>>> I have created the jira issue for the same.
>>> https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the
>>> design document there.
>>> Any suggestions regarding this are welcomed by the community.
>>> 
>>> Regards,
>>> Shardul
>>> 
>> 
>> 
>> 
>> 
>

Re: [1.5.2] Gzip Compression Support

Posted by shardul singh <sh...@gmail.com>.

Hi,
1. No it doesn't support UncompressShort/Int, Short/Int array needs to be
typecasted to byte array and then passed for compression.For uncompress we
get the result as byte array that need to be typecasted to Short/Int array
depending on requirement.
2. No it doesn't need uncompressed size.
3. Yes data copy is required during uncompression to avoid compressed data
getting modified. Also required if the offset of the data is not 0.

Regards,
Shardul

On Thu, Oct 18, 2018 at 9:09 AM Jacky Li <ja...@qq.com> wrote:

> +1
>
> I have some question:
> 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray,
> uncompresssIntArray?
> 2. Does Gzip need uncompress size to allocate the target array before
> uncompressing?
> 3. Does you solution require data copy?
>
> Regards,
> Jacky
>
> > 在 2018年10月12日，下午6:49，shardul singh <sh...@gmail.com> 写道：
> >
> > Hi community,
> > Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as
> > the compression codec offered by carbon.
> > Some benefits of having Gzip compression codec are :
> >
> >   1. Gzip offers reduced file size compared to other codec like snappy
> but
> >   at the cost of processing speed.
> >   2. Gzip is suitable for users who have cold data i.e. data which which
> >   is stored permanently and will be queried rarely.
> >
> > I have created the jira issue for the same.
> > https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the
> > design document there.
> > Any suggestions regarding this are welcomed by the community.
> >
> > Regards,
> > Shardul
> >
>
>
>
>

Re: [1.5.2] Gzip Compression Support

Posted by Jacky Li <ja...@qq.com>.

+1

I have some question:
1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray, uncompresssIntArray?
2. Does Gzip need uncompress size to allocate the target array before uncompressing?
3. Does you solution require data copy?

Regards,
Jacky

> 在 2018年10月12日，下午6:49，shardul singh <sh...@gmail.com> 写道：
> 
> Hi community,
> Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as
> the compression codec offered by carbon.
> Some benefits of having Gzip compression codec are :
> 
>   1. Gzip offers reduced file size compared to other codec like snappy but
>   at the cost of processing speed.
>   2. Gzip is suitable for users who have cold data i.e. data which which
>   is stored permanently and will be queried rarely.
> 
> I have created the jira issue for the same.
> https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the
> design document there.
> Any suggestions regarding this are welcomed by the community.
> 
> Regards,
> Shardul
>