You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Ajantha Bhat <aj...@gmail.com> on 2020/03/25 05:51:29 UTC

Disable Adaptive encoding for Double and Float by default

Hi all,

I have done insert into flow profiling using JMC with the latest code [with
new optimized insert flow]

It seems for *2.5GB* carbon to carbon insert, double and float stats
collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
buffer)]

[image: Screenshot from 2020-03-25 11-18-04.png]
*The problem is for every value of double and float in every row, we call *
*PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
every time.*

So, I want to disable Adaptive encoding for float and double by default.
*I will make this configurable.*
If some user has a well-sorted double or float column and wants to apply
adaptive encoding on that, they can enable it to reduce store size.

Thanks,
Ajantha

Re: Disable Adaptive encoding for Double and Float by default

Posted by David CaiQiang <da...@gmail.com>.

I agree with Ravindra and I can try to fix it (I have mentioned in a PR
review comment). 



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Disable Adaptive encoding for Double and Float by default

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi,

*I was able to decrease the memory usage in TLAB from 68GB to 29.94 GB for
the same TPCH data* *without disabling adaptive encoding*.

*There is about 5% improvement in insert also*. Please check the PR.

https://github.com/apache/carbondata/pull/3682

Before the change:
[image: Screenshot from 2020-03-26 16-45-12]
<https://user-images.githubusercontent.com/5889404/77640947-380c0e80-6f81-11ea-97ff-f1b8942d99c6.png>

After the change:

[image: Screenshot from 2020-03-26 16-51-31]
<https://user-images.githubusercontent.com/5889404/77641533-34c55280-6f82-11ea-8a60-bfb6c8d8f52a.png>

Thanks,

Ajantha


On Wed, Mar 25, 2020 at 2:51 PM Ravindra Pesala <ra...@gmail.com>
wrote:

> Hi Anantha,
>
> I think it is better to fix the problem instead of disabling the things. It
> is already observed that store size increases proportionally. If my data
> has more columns then it will be exponential.  Store size directly impacts
> the query performance in object store world. It is better to find a way to
> fix it rather than removing things.
>
> Regards,
> Ravindra.
>
> On Wed, 25 Mar 2020 at 5:04 PM, Ajantha Bhat <aj...@gmail.com>
> wrote:
>
> > Hi Ravi, please find the performance readings below.
> >
> > On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:
> >
> >
> > *By disabling adaptive encoding for float and double.*
> > insert is *more than 10% faster* [before 139 seconds, after this it is
> > 114 seconds] and
> > *saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3
> > GB, after this it is 2.55 GB]
> >
> > Also we have below check. If data is more than 5 decimal precision. we
> > don't apply adaptive encoding for double/float.
> > So, I am not sure how much it is useful for real-world double precision
> > data.
> >
> > [image: Screenshot from 2020-03-25 14-27-07.png]
> >
> >
> > *Bottleneck is finding that decimal points from every float and double
> > value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
> > *where we convert to string and use substring().*
> >
> > so I want to disable adaptive encoding for double and float by default.
> >
> > Thanks,
> > Ajantha
> >
> > On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala <ra...@gmail.com>
> > wrote:
> >
> >> Hi ,
> >>
> >> It increases the store size.  Can you give me performance figures with
> and
> >> without these changes.  And also provide how much store size impact if
> we
> >> disable it.
> >>
> >>
> >> Regards,
> >> Ravindra.
> >>
> >> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat <aj...@gmail.com>
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I have done insert into flow profiling using JMC with the latest code
> >> > [with new optimized insert flow]
> >> >
> >> > It seems for *2.5GB* carbon to carbon insert, double and float stats
> >> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> >> > buffer)]
> >> >
> >> > [image: Screenshot from 2020-03-25 11-18-04.png]
> >> > *The problem is for every value of double and float in every row, we
> >> call *
> >> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new
> objects
> >> > every time.*
> >> >
> >> > So, I want to disable Adaptive encoding for float and double by
> default.
> >> > *I will make this configurable.*
> >
> >
> >> > If some user has a well-sorted double or float column and wants to
> apply
> >> > adaptive encoding on that, they can enable it to reduce store size.
> >> >
> >> > Thanks,
> >> > Ajantha
> >> >
> >> --
> >> Thanks & Regards,
> >> Ravi
> >>
> > --
> Thanks & Regards,
> Ravi
>

Re: Disable Adaptive encoding for Double and Float by default

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Anantha,

I think it is better to fix the problem instead of disabling the things. It
is already observed that store size increases proportionally. If my data
has more columns then it will be exponential.  Store size directly impacts
the query performance in object store world. It is better to find a way to
fix it rather than removing things.

Regards,
Ravindra.

On Wed, 25 Mar 2020 at 5:04 PM, Ajantha Bhat <aj...@gmail.com> wrote:

> Hi Ravi, please find the performance readings below.
>
> On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:
>
>
> *By disabling adaptive encoding for float and double.*
> insert is *more than 10% faster* [before 139 seconds, after this it is
> 114 seconds] and
> *saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3
> GB, after this it is 2.55 GB]
>
> Also we have below check. If data is more than 5 decimal precision. we
> don't apply adaptive encoding for double/float.
> So, I am not sure how much it is useful for real-world double precision
> data.
>
> [image: Screenshot from 2020-03-25 14-27-07.png]
>
>
> *Bottleneck is finding that decimal points from every float and double
> value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
> *where we convert to string and use substring().*
>
> so I want to disable adaptive encoding for double and float by default.
>
> Thanks,
> Ajantha
>
> On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala <ra...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> It increases the store size.  Can you give me performance figures with and
>> without these changes.  And also provide how much store size impact if we
>> disable it.
>>
>>
>> Regards,
>> Ravindra.
>>
>> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat <aj...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > I have done insert into flow profiling using JMC with the latest code
>> > [with new optimized insert flow]
>> >
>> > It seems for *2.5GB* carbon to carbon insert, double and float stats
>> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
>> > buffer)]
>> >
>> > [image: Screenshot from 2020-03-25 11-18-04.png]
>> > *The problem is for every value of double and float in every row, we
>> call *
>> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
>> > every time.*
>> >
>> > So, I want to disable Adaptive encoding for float and double by default.
>> > *I will make this configurable.*
>
>
>> > If some user has a well-sorted double or float column and wants to apply
>> > adaptive encoding on that, they can enable it to reduce store size.
>> >
>> > Thanks,
>> > Ajantha
>> >
>> --
>> Thanks & Regards,
>> Ravi
>>
> --
Thanks & Regards,
Ravi

Re: Disable Adaptive encoding for Double and Float by default

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi Ravi, please find the performance readings below.

On TPCH 10GB data, carbon to carbon insert in on HDFS standalone cluster:


*By disabling adaptive encoding for float and double.*
insert is *more than 10% faster* [before 139 seconds, after this it is 114
seconds] and
*saves 25% memory in TLAB*store size *has increased by 10% *[before 2.3 GB,
after this it is 2.55 GB]

Also we have below check. If data is more than 5 decimal precision. we
don't apply adaptive encoding for double/float.
So, I am not sure how much it is useful for real-world double precision
data.

[image: Screenshot from 2020-03-25 14-27-07.png]


*Bottleneck is finding that decimal points from every float and double
value [*PrimitivePageStatsCollector.getDecimalCount(double)*] *
*where we convert to string and use substring().*

so I want to disable adaptive encoding for double and float by default.

Thanks,
Ajantha

On Wed, Mar 25, 2020 at 11:37 AM Ravindra Pesala <ra...@gmail.com>
wrote:

> Hi ,
>
> It increases the store size.  Can you give me performance figures with and
> without these changes.  And also provide how much store size impact if we
> disable it.
>
>
> Regards,
> Ravindra.
>
> On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat <aj...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I have done insert into flow profiling using JMC with the latest code
> > [with new optimized insert flow]
> >
> > It seems for *2.5GB* carbon to carbon insert, double and float stats
> > collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> > buffer)]
> >
> > [image: Screenshot from 2020-03-25 11-18-04.png]
> > *The problem is for every value of double and float in every row, we
> call *
> > *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
> > every time.*
> >
> > So, I want to disable Adaptive encoding for float and double by default.
> > *I will make this configurable.*
> > If some user has a well-sorted double or float column and wants to apply
> > adaptive encoding on that, they can enable it to reduce store size.
> >
> > Thanks,
> > Ajantha
> >
> --
> Thanks & Regards,
> Ravi
>

Re: Disable Adaptive encoding for Double and Float by default

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi ,

It increases the store size.  Can you give me performance figures with and
without these changes.  And also provide how much store size impact if we
disable it.


Regards,
Ravindra.

On Wed, 25 Mar 2020 at 1:51 PM, Ajantha Bhat <aj...@gmail.com> wrote:

> Hi all,
>
> I have done insert into flow profiling using JMC with the latest code
> [with new optimized insert flow]
>
> It seems for *2.5GB* carbon to carbon insert, double and float stats
> collector has used *68.36 GB* [*25%* of TLAB (Thread local allocation
> buffer)]
>
> [image: Screenshot from 2020-03-25 11-18-04.png]
> *The problem is for every value of double and float in every row, we call *
> *PrimitivePageStatsCollector.getDecimalCount()**Which makes new objects
> every time.*
>
> So, I want to disable Adaptive encoding for float and double by default.
> *I will make this configurable.*
> If some user has a well-sorted double or float column and wants to apply
> adaptive encoding on that, they can enable it to reduce store size.
>
> Thanks,
> Ajantha
>
-- 
Thanks & Regards,
Ravi