You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Yong Zhang <ja...@hotmail.com> on 2017/03/08 18:31:26 UTC

Question related to lazy decoding optimzation

Hi,

I watched one session of "Apache Carbondata" in Spark Submit 2017. The video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.

[https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<https://www.youtube.com/watch?v=lhsAg2H_GXc>

Apache Carbondata: An Indexed Columnar File Format for Interactive Query by Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
www.youtube.com
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolv...

Starting from 23:10, the speaker talks about lazy decoding optimization, and the example given in the speech is following:

"select c3, sum(c2) from t1 group by c3", and talked about that c3 can be aggregated directly by the encoding value (Maybe integer, if let's say a String type c3 is encoded as int). I assume this in fact is done even within Spark executor engine, as the Speaker described.

But I really not sure that I understand this is possible, especially in the Spark. If Carbondata is the storage format for a framework on one box, I can image that and understand this value it brings. But for a distribute executing engine, like Spark, the data will come from multi hosts. Spark has to deserialize the data for grouping/aggregating (C3 in this case). Let's say that even Spark dedicates this to underline storage engine somehow, how Carbondata will make sure that all the value will be encoded in the same globally? Won't it just encode consistently per file? Globally is just too expensive. But without it, I don't know how this lazy decoding can work.

I am just start researching this project, so maybe there are something underline I don't understand.

Thanks

Yong

Re: Question related to lazy decoding optimzation

Posted by Jacky Li <ja...@qq.com>.

Hi Yong Zhang,

Welcome to carbondata. Yes, as Ravindra mentioned, carbondata currently support lazy decode by generating global dictionary while loading. The global dictionary is incrementally appended for multiple data load. 

To avoid a costly 2-scan approach, in this version community has added a single-pass loading option to enable generating the global dictionary during the writing process of carbondata file
SQL syntax:
LOAD DATA LOCAL INPATH '$testData’ INTO TABLE table1 OPTIONS ('SINGLE_PASS'='TRUE')

It will launch a Netty server in the driver acted as a dictionary server and all executor will communicate with it to generate the dictionary. 
We encourage to use global dictionary for low cardinality column and enable this option start from the second data load, as the first data load will generate the most of the dictionary, so from second data load onwards there is only few new dictionaries key need to be generated.

Regards,
Jacky

> 在 2017年3月9日，上午11:46，Ravindra Pesala <ra...@gmail.com> 写道：
> 
> Hi Yong Zhang,
> 
> Thank you for analyzing carbondata.
> Yes, lazy decoding is only possible if the dictionaries are global.
> At the time of loading the data it generates global dictionary values.
> There are 2 ways to generate global dictionary values.
> 1. Launch a job to read all input data and find the distinct values from
> each columns and assign the dictionary values to it. Then starts the actual
> loading job, it just encodes the data with already generated dictionary
> values and write down in carbondata format.
> 2. Launch Dictionary Server/client to generate global dictionary during the
> load job. It consults dictionary server to get the global dictionary for
> the fields.
> 
> Yes, compare to local dictionary it is little more expensive but with this
> approach we can have better compression and better performance through lazy
> decoding.
> 
> 
> Regards,
> Ravindra.
> 
> On 9 March 2017 at 00:01, Yong Zhang <ja...@hotmail.com> wrote:
> 
>> Hi,
>> 
>> 
>> I watched one session of "Apache Carbondata" in Spark Submit 2017. The
>> video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.
>> 
>> [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<
>> https://www.youtube.com/watch?v=lhsAg2H_GXc>
>> 
>> Apache Carbondata: An Indexed Columnar File Format for Interactive Query
>> by Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
>> www.youtube.com
>> Realtime analytics over large datasets has become an increasing
>> wide-spread demand, over the past several years, Hadoop ecosystem has been
>> continuously evolv...
>> 
>> 
>> 
>> 
>> Starting from 23:10, the speaker talks about lazy decoding optimization,
>> and the example given in the speech is following:
>> 
>> "select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be
>> aggregated directly by the encoding value (Maybe integer, if let's say a
>> String type c3 is encoded as int). I assume this in fact is done even
>> within Spark executor engine, as the Speaker described.
>> 
>> 
>> But I really not sure that I understand this is possible, especially in
>> the Spark. If Carbondata is the storage format for a framework on one box,
>> I can image that and understand this value it brings. But for a distribute
>> executing engine, like Spark, the data will come from multi hosts. Spark
>> has to deserialize the data for grouping/aggregating (C3 in this case).
>> Let's say that even Spark dedicates this to underline storage engine
>> somehow, how Carbondata will make sure that all the value will be encoded
>> in the same globally? Won't it just encode consistently per file? Globally
>> is just too expensive. But without it, I don't know how this lazy decoding
>> can work.
>> 
>> 
>> I am just start researching this project, so maybe there are something
>> underline I don't understand.
>> 
>> 
>> Thanks
>> 
>> 
>> Yong
>> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi

Re: Question related to lazy decoding optimzation

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Yong Zhang,

Thank you for analyzing carbondata.
Yes, lazy decoding is only possible if the dictionaries are global.
At the time of loading the data it generates global dictionary values.
There are 2 ways to generate global dictionary values.
1. Launch a job to read all input data and find the distinct values from
each columns and assign the dictionary values to it. Then starts the actual
loading job, it just encodes the data with already generated dictionary
values and write down in carbondata format.
2. Launch Dictionary Server/client to generate global dictionary during the
load job. It consults dictionary server to get the global dictionary for
the fields.

Yes, compare to local dictionary it is little more expensive but with this
approach we can have better compression and better performance through lazy
decoding.

Regards,
Ravindra.

On 9 March 2017 at 00:01, Yong Zhang <ja...@hotmail.com> wrote:

> Hi,
>
>
> I watched one session of "Apache Carbondata" in Spark Submit 2017. The
> video is here: https://www.youtube.com/watch?v=lhsAg2H_GXc.
>
> [https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<
> https://www.youtube.com/watch?v=lhsAg2H_GXc>
>
> Apache Carbondata: An Indexed Columnar File Format for Interactive Query
> by Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
> www.youtube.com
> Realtime analytics over large datasets has become an increasing
> wide-spread demand, over the past several years, Hadoop ecosystem has been
> continuously evolv...
>
>
>
>
> Starting from 23:10, the speaker talks about lazy decoding optimization,
> and the example given in the speech is following:
>
> "select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be
> aggregated directly by the encoding value (Maybe integer, if let's say a
> String type c3 is encoded as int). I assume this in fact is done even
> within Spark executor engine, as the Speaker described.
>
>
> But I really not sure that I understand this is possible, especially in
> the Spark. If Carbondata is the storage format for a framework on one box,
> I can image that and understand this value it brings. But for a distribute
> executing engine, like Spark, the data will come from multi hosts. Spark
> has to deserialize the data for grouping/aggregating (C3 in this case).
> Let's say that even Spark dedicates this to underline storage engine
> somehow, how Carbondata will make sure that all the value will be encoded
> in the same globally? Won't it just encode consistently per file? Globally
> is just too expensive. But without it, I don't know how this lazy decoding
> can work.
>
>
> I am just start researching this project, so maybe there are something
> underline I don't understand.
>
>
> Thanks
>
>
> Yong
>

-- 
Thanks & Regards,
Ravi