You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@carbondata.apache.org by Jacky Li <ja...@qq.com> on 2019/11/28 16:32:57 UTC

Propose feature change in CarbonData 2.0

Hi Community,

As we are moving to CarbonData 2.0, in order to keep the project moving
forward fast and stable, it is necessary to do some refactory and clean up
obsoleted features before introducing new features. 

To do that, I propose making following features obsoleted and not supported
since 2.0. In my opinion, these features are seldom used.

1. Global dictionary
After spark 2.x, the aggregation is much faster since project tungsten, so
Global Dictionary is not much useful but it makes data loading slow and need
very complex SQL plan transformation. 

2. Bucket
Bucket feature of carbon is intented to improve join performance, but actual
improvement is very limited

3. Carbon custom partition
Since now we have Hive standard partition, old custom partition is not very
useful

4. BATCH_SORT
I have not seen anyone use this feature

5. Page level inverse index
This is arguable, I understand in a very specific scenario (when there are
many columns in IN filter) it has benefit, but it slow down the data loading
and make encoding code very complex

5. old preaggregate and time series datamap implementation
As we have introduced MV, these two features can be dropped. And we can
following the standard SQL to have a new syntax to create MV: CREATE
MATERIALIZED VIEW

6. Lucene datamap
This feature is not well implemented, as it will read too much index into
memroy thus creating memory problems in most cases.

7. STORED BY 
We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING). 


And there are some internal refactory we can do:
1. Unify dimension and measure

2. Keep the column order the same as schema order

3. Spark integration refactory based on Spark extension interface

4. Store optimization PR2729


The aim of this proposal is to make CarbonData code cleaner and reduce
community's maitenance effort. 
What do you think of it?


Regards,
Jacky





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by xuchuanyin <xu...@apache.org>.
Glad to see you making this proposal! The features you mentioned are really
not popular even the heavy user neither try them nor know their usage.

For 1/2/3/4/5.1/5.2/7, we can remove this features with their code. But if
we consider compatibility, the query processing will still be complex. How
can we solve this problem?

For 6, we may need to optimize it. If the problem lies in reading indices
into memory, we can another way to fix it, such as making slices or some
other ways.

As for the refactory points, I'm not sure about the 1st&2nd points.
As I  know, in data loading, we group dimensions and measures while writing
sort_temp_files, this can enhance the loading performance since it can
reduce the size of file. 



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by ravipesala <ra...@gmail.com>.
Hi,

Thank you for proposing. Please check my comments below.

1.Global dictionary: It was one of the prime features when it was initially
released to apache. Even though spark has introduced tungsten still it has
its benefits like compression, filtering and aggregation queries.  But after
the introduction of a local dictionary, it got solved partially like
compression and filtering (cannot get the same performance as a global
dictionary). But only the major drawback here is the data load performance.
In some cases like MOLAP cube (build once) it is still might be useful. 
Vote: 0

2. Bucket: It is a very useful feature if we use it. if we are planning to
remove better find the alternative to this feature first. Since these
features are available in spark+parquet it would be helpful for users who
want to migrate to carbon. As I know this feature was never productized and
it is still in experimental. So if we are planning to keep it better make it
productize. Vote : -1

3. Carbon custom partition: Vote : +1

4. Batch Sort : Vote : +1

5. Page level inverse index : It makes the store size bigger to store these
indexes. It is really helpful in case of multiple in filters but it is got
overshadowed by its IO and CPU performance due to its size. Vote : +1

5.  old preaggregate and time series datamap implementation : Vote : +1 
(remove pre-aggregate)

6. Lucene DataMap: It is a helpful feature but I guess it had performance
issues due to bad integration. It would be better if we can fix these issues
instead of removing it. Moreover, it is a separate module so there would not
be any code maintenance problem. Vote : -1

7. STORED BY : Vote : +1

refractory points:
1 & 2 : I think at this point of time it would be a massive refractory but
very less outcome. So better don't do it. Vote : -1

3 &4 : Vote : +1



Regards,
Ravindra.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by David CaiQiang <da...@gmail.com>.
+1



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by Akash Nilugal <ak...@gmail.com>.
Hi,

1. Global Dict - 0
2. Bucket and 3. Cutstom Partition +1
4. Batch sort  +1
5 page level 0
6. preaggregate and old timeseries + 1
7. stored by +1

Store optimization +1

I also suggest the refactoring below:
[DISCUSSION] Segment file improvement for Update and delete case.
you can find the problem statement in below discussion thread.
https://lists.apache.org/list.html?dev@carbondata.apache.org:2019-09

Regards,
Akash

On 2019/11/28 16:32:57, Jacky Li <ja...@qq.com> wrote: 
> 
> Hi Community,
> 
> As we are moving to CarbonData 2.0, in order to keep the project moving
> forward fast and stable, it is necessary to do some refactory and clean up
> obsoleted features before introducing new features. 
> 
> To do that, I propose making following features obsoleted and not supported
> since 2.0. In my opinion, these features are seldom used.
> 
> 1. Global dictionary
> After spark 2.x, the aggregation is much faster since project tungsten, so
> Global Dictionary is not much useful but it makes data loading slow and need
> very complex SQL plan transformation. 
> 
> 2. Bucket
> Bucket feature of carbon is intented to improve join performance, but actual
> improvement is very limited
> 
> 3. Carbon custom partition
> Since now we have Hive standard partition, old custom partition is not very
> useful
> 
> 4. BATCH_SORT
> I have not seen anyone use this feature
> 
> 5. Page level inverse index
> This is arguable, I understand in a very specific scenario (when there are
> many columns in IN filter) it has benefit, but it slow down the data loading
> and make encoding code very complex
> 
> 5. old preaggregate and time series datamap implementation
> As we have introduced MV, these two features can be dropped. And we can
> following the standard SQL to have a new syntax to create MV: CREATE
> MATERIALIZED VIEW
> 
> 6. Lucene datamap
> This feature is not well implemented, as it will read too much index into
> memroy thus creating memory problems in most cases.
> 
> 7. STORED BY 
> We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING). 
> 
> 
> And there are some internal refactory we can do:
> 1. Unify dimension and measure
> 
> 2. Keep the column order the same as schema order
> 
> 3. Spark integration refactory based on Spark extension interface
> 
> 4. Store optimization PR2729
> 
> 
> The aim of this proposal is to make CarbonData code cleaner and reduce
> community's maitenance effort. 
> What do you think of it?
> 
> 
> Regards,
> Jacky
> 
> 
> 
> 
> 
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> 

Re:Propose feature change in CarbonData 2.0

Posted by 恩爸 <44...@qq.com>.
Hi:
&nbsp;&nbsp;Thank you for proposing. My votes are below:


&nbsp; 1,3,4,5.1,5.2,7:&nbsp; +1
&nbsp; 2:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0
&nbsp; 6:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; -1,&nbsp;but should be optimzied.


&nbsp;&nbsp;And there are some internal refactory we can do:
&nbsp; 1. Unify dimension and measure&nbsp; &nbsp;+1.

&nbsp; 2. Keep the column order the same as schema order&nbsp; &nbsp;0.

&nbsp; 3. Spark integration refactory based on Spark extension interface&nbsp; &nbsp;+1

&nbsp; 4. Store optimization PR2729&nbsp; &nbsp;+1

&nbsp; In my opinion, we also can do some&nbsp; refactor:&nbsp; 1. there are many places using string[] to store data in the process of loading data, it can replace with InternalRow object to save&nbsp; memory;
&nbsp; 2. remove 'streaming' property and eliminate the difference between streaming and batch table, users can insert data into a table by batch way and streaming way.






------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"Jacky Li [via Apache CarbonData Dev Mailing List archive]"<ml+s1130556n87540h76@n5.nabble.com&gt;;
Date:&nbsp;Fri, Nov 29, 2019 00:33 AM
To:&nbsp;"恩爸"<441586683@qq.com&gt;;

Subject:&nbsp;Propose feature change in CarbonData 2.0



 	
Hi Community, 

As we are moving to CarbonData 2.0, in order to keep the project moving 
forward fast and stable, it is necessary to do some refactory and clean up 
obsoleted features before introducing new features.  

To do that, I propose making following features obsoleted and not supported 
since 2.0. In my opinion, these features are seldom used. 

1. Global dictionary 
After spark 2.x, the aggregation is much faster since project tungsten, so 
Global Dictionary is not much useful but it makes data loading slow and need 
very complex SQL plan transformation.  

2. Bucket 
Bucket feature of carbon is intented to improve join performance, but actual 
improvement is very limited 

3. Carbon custom partition 
Since now we have Hive standard partition, old custom partition is not very 
useful 

4. BATCH_SORT 
I have not seen anyone use this feature 

5. Page level inverse index 
This is arguable, I understand in a very specific scenario (when there are 
many columns in IN filter) it has benefit, but it slow down the data loading 
and make encoding code very complex 

5. old preaggregate and time series datamap implementation 
As we have introduced MV, these two features can be dropped. And we can 
following the standard SQL to have a new syntax to create MV: CREATE 
MATERIALIZED VIEW 

6. Lucene datamap 
This feature is not well implemented, as it will read too much index into 
memroy thus creating memory problems in most cases. 

7. STORED BY  
We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING).  


And there are some internal refactory we can do: 
1. Unify dimension and measure 

2. Keep the column order the same as schema order 

3. Spark integration refactory based on Spark extension interface 

4. Store optimization PR2729 


The aim of this proposal is to make CarbonData code cleaner and reduce 
community's maitenance effort.  
What do you think of it? 


Regards, 
Jacky 





-- 
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
 	 	 	 	
 	
 	
 	 		If you reply to this email, your message will be added to the discussion below:
 		http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540.html 	
 	 		To start a new topic under Apache CarbonData Dev Mailing List archive, email ml+s1130556n1h67@n5.nabble.com 
 		To unsubscribe from Apache CarbonData Dev Mailing List archive, click here.
 		NAML

Re: Propose feature change in CarbonData 2.0

Posted by Jacky Li <ja...@qq.com>.
Hi Benoit,

Thanks for pointing this out.

Yes, it will be carbon 1 preaggregate datamap. The MV implementation in
CarbonData will check whether the MV is an aggregation on single table, if
yes, it will be "always in sync" (will automatically trigger load to MV
table after loading the main table).

But if the MV involved multiple table join, it need to be manually rebuild.

Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by Benoit Rousseau <b....@brci.fr>.
Hi,

Considering 5: 
In carbon 2.0 will MV be "always in sync" like carbon 1 pre aggregate datamap or will they require action to be put back online at each update ?

Vertica, Clickhouse, Vector and  some other first class OLAP engine offers "always in sync" pre aggregate views which are very convenient.

Thanks,
Benoit 


> On 29 Nov 2019, at 13:19, xubo245 <60...@qq.com> wrote:
> 

In my opinion, carbon 2.0 is the right time to clean up some unused featture
to make code cleaner and reduce maintenance effort

+1 agree ,-1 disagree.  0, other.

+1: 1,2,3,5, 5,  7 
0: 4,
-1:6, but should be optimzied.




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by xubo245 <60...@qq.com>.
In my opinion, carbon 2.0 is the right time to clean up some unused featture
to make code cleaner and reduce maintenance effort

+1 agree ,-1 disagree.  0, other.

+1: 1,2,3,5, 5,  7 
0: 4,
-1:6, but should be optimzied.




--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Propose feature change in CarbonData 2.0

Posted by Kumar Vishal <ku...@gmail.com>.
Please find my comment inline
Bucketing +1
Carbon custom partition +1
BATCH_SORT +1
old preaggregate and time series datamap implementation +1
STORED BY +1

Global dictionary:
Data loading with global dictionary is slow but aggregation, filtering,
compression is better than any other type, storing raw value or with
location dictionary. So it might be useful feature
Vote: 0

Page Level Inverted Index: -1
If user know column on which he/she is going to use IN filter it is very
useful

Lucene datamap: Performance is bad because of some code/design issue which
can be fixed -1

And there are some internal refactory we can do:
1. Unify dimension and measure:
     It may improve IO performance but effort is high. 0

3. Spark integration refactory based on Spark extension interface +1

4. Store optimization PR2729 +1

-Regards
Kumar Vishal

On Thu, Dec 5, 2019 at 3:28 PM Jacky Li <ja...@qq.com> wrote:

> Hi,
>
> Thanks for all your input, the voting summary is as below:
>
> 1. Global dictionary
> No -1
>
> 2. Bucket
> Two -1
>
> 3. Carbon custom partition
> No -1
>
> 4. BATCH_SORT
> No -1
>
> 5. Page level inverse index
> One -1
>
> 5. old preaggregate and time series datamap implementation
> No -1
>
> 6. Lucene datamap
> Five -1
>
> 7. STORED BY
> No -1
>
> So, I have created an umbrella JIRA (CARBONDATA-3603) for these items.
> Please feel free to response if anyone interested working on them
>
> Regards,
> Jacky
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: Propose feature change in CarbonData 2.0

Posted by Jacky Li <ja...@qq.com>.
Hi,

Thanks for all your input, the voting summary is as below:

1. Global dictionary 
No -1

2. Bucket 
Two -1

3. Carbon custom partition 
No -1

4. BATCH_SORT 
No -1

5. Page level inverse index 
One -1

5. old preaggregate and time series datamap implementation 
No -1

6. Lucene datamap 
Five -1

7. STORED BY 
No -1

So, I have created an umbrella JIRA (CARBONDATA-3603) for these items. 
Please feel free to response if anyone interested working on them

Regards,
Jacky








--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/