You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Indhumathi <in...@gmail.com> on 2020/01/14 08:38:41 UTC

[Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Hello all,

In Cloud scenarios, index is too big to store in SparkDriver, since VM may
not have so much memory.
Currently in Carbon, we will load all indexes to cache for first time query.
Since Carbon LRU Cache does 
not support time-based expiration, indexes will be removed from cache based
on LeastRecentlyUsed mechanism,
when the carbon lru cache is full.

In some scenarios, where user's table has more segments and if user queries
only very few segments often, we no
need to load all indexes to cache. For filter queries, if we prune and load
only matched segments to cache, 
then driver's memory will be saved.

For this purpose, I am planing to add block minmax to segment metadata file
and prune segment based on segment files and
load index only for matched segment. As part of this, will add a
configurable carbon property '*carbon.load.all.index.to.cache*' 
to allow user to load all indexes to cache if needed. BY default, value will
be true.

Currently, for each load, we will write a segment metadata file, while holds
the information about indexFile. 
During query, we will read each segmentFile for getting indexFileInfo and
then we will load all datamaps for the segment.
MinMax data will be encoded and stored in segment file.

Any suggestions/inputs from the community is appreciated.

Thanks
Indhumathi



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Posted by Ajantha Bhat <aj...@gmail.com>.

+1,

Can you explain more about how are you encoding and storing min max in
segment file?As minmax values represent user data, we cannot store as plain
values. Storing encrypted min max will add overhead of encrypting and
decrypting.
I suggest we can convert segment file to thrift file to solve this. Other
suggestions are welcome.

Thanks,
Ajantha

On Tue, 14 Jan, 2020, 4:37 pm Indhumathi, <in...@gmail.com> wrote:

> Hello all,
>
> In Cloud scenarios, index is too big to store in SparkDriver, since VM may
> not have so much memory.
> Currently in Carbon, we will load all indexes to cache for first time
> query.
> Since Carbon LRU Cache does
> not support time-based expiration, indexes will be removed from cache based
> on LeastRecentlyUsed mechanism,
> when the carbon lru cache is full.
>
> In some scenarios, where user's table has more segments and if user queries
> only very few segments often, we no
> need to load all indexes to cache. For filter queries, if we prune and load
> only matched segments to cache,
> then driver's memory will be saved.
>
> For this purpose, I am planing to add block minmax to segment metadata file
> and prune segment based on segment files and
> load index only for matched segment. As part of this, will add a
> configurable carbon property '*carbon.load.all.index.to.cache*'
> to allow user to load all indexes to cache if needed. BY default, value
> will
> be true.
>
> Currently, for each load, we will write a segment metadata file, while
> holds
> the information about indexFile.
> During query, we will read each segmentFile for getting indexFileInfo and
> then we will load all datamaps for the segment.
> MinMax data will be encoded and stored in segment file.
>
> Any suggestions/inputs from the community is appreciated.
>
> Thanks
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Posted by David CaiQiang <da...@gmail.com>.

+1



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Posted by Indhumathi <in...@gmail.com>.

1. can you tell me how you gonna read the in max? I mean to say, are you
going to store the segment level min max for all the column or since you
said blocklevel, it means for every carbondata file
you are going to store it? If it is block level,in case of more file,
segment file size might increase. Can you please explain more about this?

>>  yes. I am planning to store MinMax for all columns in segment file. I
>> agree that, 
segment file may increase, in case of more file. For solving this, i think
we could 
store minMax only for sort columns. We can add a table property to control
it.
what do you think?


2. How are you going to get the min max in driver? its obvious that you are
not planning to read the file.

>> During writing index file, will get minMax info for each block and store
>> it into an 
SegmentMinMax object. When InsertionTaskCompletion listener is called, will
add this 
segmentMinMax info into an accumulator. Later, while writing the segment
file, will get 
minMax Info from accmulator and serilaize and store into it. During query,
will read minmax 
from segment file and cache it and use for segment level pruning.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Posted by akashrn5 <ak...@gmail.com>.

Hi Indhumathi,

+1. It solves many memory problems and improves first time filter query.
I have some doubts.

1. can you tell me how you gonna read the in max? I mean to say, are you
going to store the segment level min max for all the column or since you
said blocklevel, it means for every carbondata file
you are going to store it? If it is block level,in case of more file,
segment file size might increase. Can you please explain more about this? 
2. How are you going to get the min max in driver? its obvious that you are
not planning to read the file.

Regards,
Akash



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

Posted by Jacky Li <ja...@qq.com>.

+1 
This can reduce the memory footprint in spark driver, it is great for ultra big data

Regards,
Jacky

> 2020年1月14日 下午4:38，Indhumathi <in...@gmail.com> 写道：
> 
> Hello all,
> 
> In Cloud scenarios, index is too big to store in SparkDriver, since VM may
> not have so much memory.
> Currently in Carbon, we will load all indexes to cache for first time query.
> Since Carbon LRU Cache does 
> not support time-based expiration, indexes will be removed from cache based
> on LeastRecentlyUsed mechanism,
> when the carbon lru cache is full.
> 
> In some scenarios, where user's table has more segments and if user queries
> only very few segments often, we no
> need to load all indexes to cache. For filter queries, if we prune and load
> only matched segments to cache, 
> then driver's memory will be saved.
> 
> For this purpose, I am planing to add block minmax to segment metadata file
> and prune segment based on segment files and
> load index only for matched segment. As part of this, will add a
> configurable carbon property '*carbon.load.all.index.to.cache*' 
> to allow user to load all indexes to cache if needed. BY default, value will
> be true.
> 
> Currently, for each load, we will write a segment metadata file, while holds
> the information about indexFile. 
> During query, we will read each segmentFile for getting indexFileInfo and
> then we will load all datamaps for the segment.
> MinMax data will be encoded and stored in segment file.
> 
> Any suggestions/inputs from the community is appreciated.
> 
> Thanks
> Indhumathi
> 
> 
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>