You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Jacky Li <ja...@qq.com> on 2020/02/07 02:40:00 UTC

回复： Discussion: change default compressor to ZSTD

Hi Ajantha,


Yes, decoder will use the compressorName stored in ChunkCompressionMeta from the file header,
but I think it is better to put it in the name so that user can know the compressor in the shell without reading it by launching engine.


In spark, for parquet/orc the file name written is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc


In PR3606, I will handle the compatibility.


Regards,
Jacky


------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Ajantha Bhat"<ajanthabhat@gmail.com&gt;;
发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
收件人:&nbsp;"dev"<dev@carbondata.apache.org&gt;;

主题:&nbsp;Re: Discussion: change default compressor to ZSTD



Hi,

33% is huge a reduction in store size. If there is negligible difference in
load and query time, we should definitely go for it.

And does user really need to know about what compression is used ? change
in file name may be need to handle compatibility.
Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
name. query time decoding can be based on this.

Thanks,
Ajantha


On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.likun@qq.com&gt; wrote:

&gt; Hi,
&gt;
&gt;
&gt; I compared snappy and zstd compressor using TPCH for carbondata.
&gt;
&gt;
&gt; For TPCH lineitem table:
&gt; carbon-zstdcarbon-snappy
&gt; loading (s)5351
&gt; size795MB1.2GB
&gt;
&gt; TPCH-query:
&gt; Q14.2898.29
&gt; Q212.60912.986
&gt; Q314.90214.458
&gt; Q46.2765.954
&gt; Q523.14721.946
&gt; Q61.120.945
&gt; Q723.01728.007
&gt; Q814.55415.077
&gt; Q928.47227.473
&gt; Q1024.06724.682
&gt; Q113.3213.79
&gt; Q125.3115.185
&gt; Q1314.0811.84
&gt; Q142.2622.087
&gt; Q155.4964.772
&gt; Q1629.91929.833
&gt; Q177.0187.057
&gt; Q1817.36717.795
&gt; Q192.9312.865
&gt; Q2011.34710.937
&gt; Q2126.41628.414
&gt; Q225.9236.311
&gt; sum283.844290.704
&gt;
&gt;
&gt; As you can see, after using zstd, table size is 33% reduced comparing to
&gt; snappy. And the data loading and query time difference is negligible. So I
&gt; suggest to change the default compressor in carbondata from snappy to zstd.
&gt;
&gt;
&gt; To change the default compressor, we need to:
&gt; 1. append the compressor name in the carbondata file name. So that from
&gt; the file name user can know what compressor is used.
&gt; For example, file name will be changed from
&gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
&gt; to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
&gt; or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
&gt;
&gt;
&gt; 2. Change the compressor constant in CarbonCommonConstaint.java file to
&gt; use zstd as default compressor
&gt;
&gt;
&gt; What do you think?
&gt;
&gt;
&gt; Regards,
&gt; Jacky

Re: Discussion: change default compressor to ZSTD

Posted by Jacky Li <ja...@qq.com>.

Ok, thanks for the test.
Then for PR3606, I will only add the compressor name to the file name but not changing the default compressor to ZSTD.

Regards,
Jacky

> 2020年2月20日 下午12:52，Ajantha Bhat <aj...@gmail.com> 写道：
> 
> Hi Jacky and Ravindra,
> 
> we have tested ZSTD vs snappy again with the latest code in 3 node spark
> 2.3 cluster on HDFS with TPCH 500 GB data.
> Below is the summary
> 
> *1.  ZSTD store is 28.8% smaller compared to snappy*
> *2.  Overall query time is degraded by 18.35% in ZSTD compared to snappy*
> *3.  Load time in ZSTD has negligible degradation of 0.7 % compared to
> snappy*
> 
> Based on this, I guess we cannot use ZSTD as default due to huge
> degradation in query time.
> 
> Thanks,
> Ajantha
> 
> 
> 
> 
> On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <ra...@gmail.com>
> wrote:
> 
>> Hi Jacky,
>> 
>> As per the original PR
>> https://github.com/apache/carbondata/pull/2628 , query performance got
>> decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
>> performance. Please better have a proper tpch performance report on the
>> regular cluster like we do for every version and decide based on that.
>> 
>> Regards,
>> Ravindra.
>> 
>> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <ja...@qq.com> wrote:
>> 
>>> Hi Ajantha,
>>> 
>>> 
>>> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
>>> from the file header,
>>> but I think it is better to put it in the name so that user can know the
>>> compressor in the shell without reading it by launching engine.
>>> 
>>> 
>>> In spark, for parquet/orc the file name written
>>> is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>>> 
>>> 
>>> In PR3606, I will handle the compatibility.
>>> 
>>> 
>>> Regards,
>>> Jacky
>>> 
>>> 
>>> ------------------&nbsp;原始邮件&nbsp;------------------
>>> 发件人:&nbsp;"Ajantha Bhat"<ajanthabhat@gmail.com&gt;;
>>> 发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
>>> 收件人:&nbsp;"dev"<dev@carbondata.apache.org&gt;;
>>> 
>>> 主题:&nbsp;Re: Discussion: change default compressor to ZSTD
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> 33% is huge a reduction in store size. If there is negligible difference
>> in
>>> load and query time, we should definitely go for it.
>>> 
>>> And does user really need to know about what compression is used ? change
>>> in file name may be need to handle compatibility.
>>> Already thrift *FileHeader, ChunkCompressionMeta* is storing the
>> compressor
>>> name. query time decoding can be based on this.
>>> 
>>> Thanks,
>>> Ajantha
>>> 
>>> 
>>> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.likun@qq.com&gt; wrote:
>>> 
>>> &gt; Hi,
>>> &gt;
>>> &gt;
>>> &gt; I compared snappy and zstd compressor using TPCH for carbondata.
>>> &gt;
>>> &gt;
>>> &gt; For TPCH lineitem table:
>>> &gt; carbon-zstdcarbon-snappy
>>> &gt; loading (s)5351
>>> &gt; size795MB1.2GB
>>> &gt;
>>> &gt; TPCH-query:
>>> &gt; Q14.2898.29
>>> &gt; Q212.60912.986
>>> &gt; Q314.90214.458
>>> &gt; Q46.2765.954
>>> &gt; Q523.14721.946
>>> &gt; Q61.120.945
>>> &gt; Q723.01728.007
>>> &gt; Q814.55415.077
>>> &gt; Q928.47227.473
>>> &gt; Q1024.06724.682
>>> &gt; Q113.3213.79
>>> &gt; Q125.3115.185
>>> &gt; Q1314.0811.84
>>> &gt; Q142.2622.087
>>> &gt; Q155.4964.772
>>> &gt; Q1629.91929.833
>>> &gt; Q177.0187.057
>>> &gt; Q1817.36717.795
>>> &gt; Q192.9312.865
>>> &gt; Q2011.34710.937
>>> &gt; Q2126.41628.414
>>> &gt; Q225.9236.311
>>> &gt; sum283.844290.704
>>> &gt;
>>> &gt;
>>> &gt; As you can see, after using zstd, table size is 33% reduced
>> comparing
>>> to
>>> &gt; snappy. And the data loading and query time difference is
>> negligible.
>>> So I
>>> &gt; suggest to change the default compressor in carbondata from snappy
>> to
>>> zstd.
>>> &gt;
>>> &gt;
>>> &gt; To change the default compressor, we need to:
>>> &gt; 1. append the compressor name in the carbondata file name. So that
>>> from
>>> &gt; the file name user can know what compressor is used.
>>> &gt; For example, file name will be changed from
>>> &gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
>>> &gt;
>>> 
>> to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
>>> &gt;
>>> or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>>> &gt;
>>> &gt;
>>> &gt; 2. Change the compressor constant in CarbonCommonConstaint.java file
>>> to
>>> &gt; use zstd as default compressor
>>> &gt;
>>> &gt;
>>> &gt; What do you think?
>>> &gt;
>>> &gt;
>>> &gt; Regards,
>>> &gt; Jacky
>> 
>> --
>> Thanks & Regards,
>> Ravi
>>

Re: Discussion: change default compressor to ZSTD

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi Jacky and Ravindra,

we have tested ZSTD vs snappy again with the latest code in 3 node spark
2.3 cluster on HDFS with TPCH 500 GB data.
Below is the summary

*1.  ZSTD store is 28.8% smaller compared to snappy*
*2.  Overall query time is degraded by 18.35% in ZSTD compared to snappy*
*3.  Load time in ZSTD has negligible degradation of 0.7 % compared to
snappy*

Based on this, I guess we cannot use ZSTD as default due to huge
degradation in query time.

Thanks,
Ajantha




On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <ra...@gmail.com>
wrote:

> Hi Jacky,
>
> As per the original PR
> https://github.com/apache/carbondata/pull/2628 , query performance got
> decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
> performance. Please better have a proper tpch performance report on the
> regular cluster like we do for every version and decide based on that.
>
> Regards,
> Ravindra.
>
> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <ja...@qq.com> wrote:
>
> > Hi Ajantha,
> >
> >
> > Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> > from the file header,
> > but I think it is better to put it in the name so that user can know the
> > compressor in the shell without reading it by launching engine.
> >
> >
> > In spark, for parquet/orc the file name written
> > is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
> >
> >
> > In PR3606, I will handle the compatibility.
> >
> >
> > Regards,
> > Jacky
> >
> >
> > ------------------&nbsp;原始邮件&nbsp;------------------
> > 发件人:&nbsp;"Ajantha Bhat"<ajanthabhat@gmail.com&gt;;
> > 发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
> > 收件人:&nbsp;"dev"<dev@carbondata.apache.org&gt;;
> >
> > 主题:&nbsp;Re: Discussion: change default compressor to ZSTD
> >
> >
> >
> > Hi,
> >
> > 33% is huge a reduction in store size. If there is negligible difference
> in
> > load and query time, we should definitely go for it.
> >
> > And does user really need to know about what compression is used ? change
> > in file name may be need to handle compatibility.
> > Already thrift *FileHeader, ChunkCompressionMeta* is storing the
> compressor
> > name. query time decoding can be based on this.
> >
> > Thanks,
> > Ajantha
> >
> >
> > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.likun@qq.com&gt; wrote:
> >
> > &gt; Hi,
> > &gt;
> > &gt;
> > &gt; I compared snappy and zstd compressor using TPCH for carbondata.
> > &gt;
> > &gt;
> > &gt; For TPCH lineitem table:
> > &gt; carbon-zstdcarbon-snappy
> > &gt; loading (s)5351
> > &gt; size795MB1.2GB
> > &gt;
> > &gt; TPCH-query:
> > &gt; Q14.2898.29
> > &gt; Q212.60912.986
> > &gt; Q314.90214.458
> > &gt; Q46.2765.954
> > &gt; Q523.14721.946
> > &gt; Q61.120.945
> > &gt; Q723.01728.007
> > &gt; Q814.55415.077
> > &gt; Q928.47227.473
> > &gt; Q1024.06724.682
> > &gt; Q113.3213.79
> > &gt; Q125.3115.185
> > &gt; Q1314.0811.84
> > &gt; Q142.2622.087
> > &gt; Q155.4964.772
> > &gt; Q1629.91929.833
> > &gt; Q177.0187.057
> > &gt; Q1817.36717.795
> > &gt; Q192.9312.865
> > &gt; Q2011.34710.937
> > &gt; Q2126.41628.414
> > &gt; Q225.9236.311
> > &gt; sum283.844290.704
> > &gt;
> > &gt;
> > &gt; As you can see, after using zstd, table size is 33% reduced
> comparing
> > to
> > &gt; snappy. And the data loading and query time difference is
> negligible.
> > So I
> > &gt; suggest to change the default compressor in carbondata from snappy
> to
> > zstd.
> > &gt;
> > &gt;
> > &gt; To change the default compressor, we need to:
> > &gt; 1. append the compressor name in the carbondata file name. So that
> > from
> > &gt; the file name user can know what compressor is used.
> > &gt; For example, file name will be changed from
> > &gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> > &gt;
> >
> to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> > &gt;
> > or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> > &gt;
> > &gt;
> > &gt; 2. Change the compressor constant in CarbonCommonConstaint.java file
> > to
> > &gt; use zstd as default compressor
> > &gt;
> > &gt;
> > &gt; What do you think?
> > &gt;
> > &gt;
> > &gt; Regards,
> > &gt; Jacky
>
> --
> Thanks & Regards,
> Ravi
>

Re: Discussion: change default compressor to ZSTD

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Jacky,

As per the original PR
https://github.com/apache/carbondata/pull/2628 , query performance got
decreased by 20% ~ 50% compared to snappy.  So I am concerned about the
performance. Please better have a proper tpch performance report on the
regular cluster like we do for every version and decide based on that.

Regards,
Ravindra.

On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <ja...@qq.com> wrote:

> Hi Ajantha,
>
>
> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> from the file header,
> but I think it is better to put it in the name so that user can know the
> compressor in the shell without reading it by launching engine.
>
>
> In spark, for parquet/orc the file name written
> is:&nbsp;part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>
>
> In PR3606, I will handle the compatibility.
>
>
> Regards,
> Jacky
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:&nbsp;"Ajantha Bhat"<ajanthabhat@gmail.com&gt;;
> 发送时间:&nbsp;2020年2月6日(星期四) 晚上11:51
> 收件人:&nbsp;"dev"<dev@carbondata.apache.org&gt;;
>
> 主题:&nbsp;Re: Discussion: change default compressor to ZSTD
>
>
>
> Hi,
>
> 33% is huge a reduction in store size. If there is negligible difference in
> load and query time, we should definitely go for it.
>
> And does user really need to know about what compression is used ? change
> in file name may be need to handle compatibility.
> Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
> name. query time decoding can be based on this.
>
> Thanks,
> Ajantha
>
>
> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <jacky.likun@qq.com&gt; wrote:
>
> &gt; Hi,
> &gt;
> &gt;
> &gt; I compared snappy and zstd compressor using TPCH for carbondata.
> &gt;
> &gt;
> &gt; For TPCH lineitem table:
> &gt; carbon-zstdcarbon-snappy
> &gt; loading (s)5351
> &gt; size795MB1.2GB
> &gt;
> &gt; TPCH-query:
> &gt; Q14.2898.29
> &gt; Q212.60912.986
> &gt; Q314.90214.458
> &gt; Q46.2765.954
> &gt; Q523.14721.946
> &gt; Q61.120.945
> &gt; Q723.01728.007
> &gt; Q814.55415.077
> &gt; Q928.47227.473
> &gt; Q1024.06724.682
> &gt; Q113.3213.79
> &gt; Q125.3115.185
> &gt; Q1314.0811.84
> &gt; Q142.2622.087
> &gt; Q155.4964.772
> &gt; Q1629.91929.833
> &gt; Q177.0187.057
> &gt; Q1817.36717.795
> &gt; Q192.9312.865
> &gt; Q2011.34710.937
> &gt; Q2126.41628.414
> &gt; Q225.9236.311
> &gt; sum283.844290.704
> &gt;
> &gt;
> &gt; As you can see, after using zstd, table size is 33% reduced comparing
> to
> &gt; snappy. And the data loading and query time difference is negligible.
> So I
> &gt; suggest to change the default compressor in carbondata from snappy to
> zstd.
> &gt;
> &gt;
> &gt; To change the default compressor, we need to:
> &gt; 1. append the compressor name in the carbondata file name. So that
> from
> &gt; the file name user can know what compressor is used.
> &gt; For example, file name will be changed from
> &gt; &amp;nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> &gt;
> to&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> &gt;
> or&amp;nbsp;&amp;nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> &gt;
> &gt;
> &gt; 2. Change the compressor constant in CarbonCommonConstaint.java file
> to
> &gt; use zstd as default compressor
> &gt;
> &gt;
> &gt; What do you think?
> &gt;
> &gt;
> &gt; Regards,
> &gt; Jacky

-- 
Thanks & Regards,
Ravi