You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Jacky Li <ja...@qq.com> on 2016/10/01 05:31:05 UTC

Abstracting CarbonData's Index Interface

Hi community,

Currently CarbonData have builtin index support which is one of the key
strength of CarbonData. Using index, CarbonData can do very fast filter
query by filtering on block and blocklet level. However, it also introduces
memory consumption of the index tree and impact first query time because the
process of loading of index from file footer into memory. On the other side,
in a multi-tennant environment, multiple applications may access data files
simultaneously, which again exacerbate this resource consumption issue.
So, I want to propose and discuss a solution with you to solve this
problem and make an abstraction of interface for CarbonData's future
evolvement.
I am thinking the final result of this work should achieve at least two
goals:

Goal 1: User can choose the place to store Index data, it can be stored in
processing framework's memory space (like in spark driver memory) or in
another service outside of the processing framework (like using a
independent database service)

Goal 2: Developer can add more index of his choice to CarbonData files.
Besides B+ tree on multi-dimensional key which current CarbonData supports,
developers are free to add other indexing technology to make certain
workload faster. These new indices should be added in a pluggable way.

In order to achieve these goals, an abstraction need to be created for
CarbonData project, including:

- Segment: each segment is presenting one load of data, and tie with some
indices created with this load

- Index: index is created when this segment is created, and is leveraged
when CarbonInputFormat's getSplit is called, to filter out the required
blocks or even blocklets.

- CarbonInputFormat: There maybe n number of indices created for data file,
when querying these data files, InputFormat should know how to access these
indices, and initialize or loading these index if required.

Obviously, this work should be separated into different tasks and
implemented gradually. But first of all, let's discuss on the goal and the
proposed approach. What is your idea?

Regards,
Jacky

--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

RE: Abstracting CarbonData's Index Interface

Posted by Jihong Ma <Ji...@huawei.com>.

we are all on the same page w.r.t inputFormat interface to perform predicate /column projection pushdown with help of index at FILE LEVEL, for achieving Goal 2) specifically.

For achieving Goal 1), to realize query predicate/column pruning at TABLE LEVEL through leveraging index (of any form, no matter where it is stored), this should be achieved at query planning phase, and exposed to optimizer to pick the right plan based on cost(whether to leverage index, pick which index?). 

Jenny

-----Original Message-----
From: Jacky Li [mailto:jacky.likun@qq.com] 
Sent: Monday, October 03, 2016 9:15 PM
To: dev@carbondata.incubator.apache.org
Subject: Re: Abstracting CarbonData's Index Interface


> 在 2016年10月4日，上午8:01，Jihong Ma <Ji...@huawei.com> 写道：
> 
> It is a great idea to open the door for more flexible/scalable way of accessing index to help with query processing.  if our goals are as following: 
> 
>> Goal 1: User can choose the place to store Index data, it can be stored in processing framework's memory space (like in spark driver memory) or in 
> another service outside of the processing framework (like using a independent database service)
> 
>  the current approach of leveraging index for filter predicate/column projection pushdown through index is invisible to optimizer, since index can live outside of the current processing framework and physically separated from the base data,  I am inclining with exposing index scan visible to optimizer, and allow optimizer to weight in and decide which index to pick,  considering we would also like to extend the support beyond b+ tree as described in Goal 2).  
> 

I think the bottom line is that CarbonData index should be easy enough to integrate into different processing frameworks, which I think InputFormat is the best choice as of now. Integration in this level, enable all processing framework can leverage CarbonData’s index capability without any plugin code in the upper framework.

The next level of integration is in optimizer level, in this case, unless the optimizer is capable of evaluating cost of using different indices, exposing the index as API to optimizer is meaningful. I think we can always to do that later on, if we have added more indices to CarbonData and come up with a neat API for optimizer to use.

> 
>> Goal 2: Developer can add more index of his choice to CarbonData files.
> Besides B+ tree on multi-dimensional key which current CarbonData supports, developers are free to add other indexing technology to make certain workload faster. These new indices should be added in a pluggable way.
> 
>  Great to make it easily extensible, we can define a clean abstract API to unify the interface (build/insert/delete/scan..), refer to PostgreSQL on this matter(thanks Qingqing for sharing the link):
>  https://www.postgresql.org/docs/8.4/static/indexam.html.
> 

I think you are right, we should have a clean Index API, currently I am thinking of read and load API only: https://github.com/apache/incubator-carbondata/pull/208/files#diff-ec23f9a4b91d70309522eef78689be1c <https://github.com/apache/incubator-carbondata/pull/208/files#diff-ec23f9a4b91d70309522eef78689be1c>

For the traditional database like PostgreSQL, since they need to deal with transactions so their index API need to consider transaction as well, while in CarbonData,  I am not sure how far we can go and how much mutability we can add to Index API. Currently, its model is using Segment concept to do incremental load, so Index is immutable. If data is append only, I think read and load of index is enough. But if we introduce update into CarbonData, then index should be updatable as well, I think it is better to consider this together in data update feature if there are any in the future?

> 
> Jenny
> 
> -----Original Message-----
> From: Jacky Li [mailto:jacky.likun@qq.com] 
> Sent: Sunday, October 02, 2016 10:25 PM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: Abstracting CarbonData's Index Interface
> 
> After a second thought regarding the index part, another option is that to have a very simple Segment definition which can only list all files it has or listFile taking the QueryModel as input, implementation of Segment can be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In future, developer is free to create MultiIndexSegment to select index internally. Is this option better?
> 
> Regards,
> Jacky
> 
>> 在 2016年10月3日，上午11:00，Jacky Li <ja...@qq.com> 写道：
>> 
>> I am currently thinking these abstractions:
>> 
>> - A SegmentManager is the global manager of all segments for one table. It can be used to get all segments and manage the segment while loading and compaction.
>> - A CarbonInputFormat will take the input of table path, so means it represent the whole table contain all segments.  When getSplit is called, it will get all segments by calling SegmentManager interface.
>> - Each Segment contains a list of Index, and an IndexSelector. While currently CarbonData only has MDK index, developer can create multiple indices for each segment in the future.
>> - An Index is an interface to filtering on block/blocklet, and provide this functionality only.  Implementation should hide all complexity like deciding where to store the index.
>> - An IndexSelector is an interface to choose which index to use based on query predicates. Default implementation is to choose the first index. An implementation of IndexChooser can also decide not to use index at all.
>> - A Distributor is used to map the filtered block/blocklet to InputSplits. Implementation can take number of node, parallelism into consideration. It can also decide to distribute tasks based on block or blocklet.
>> 
>> So the main concepts are SegmentManager, Segment, Index, IndexSelector, InputFormat/OutputFormat, Distributor.
>> 
>> There will be a default implementation of CarbonInputFormat whose getSplit will do the following: 
>> 1. gat all segments by calling SegmentManager
>> 2. for each segment, choose the index to use by IndexSelector
>> 3. invoke the selected Index to filter out block/blocklet (since these are two concept, maybe a parent class need to be created to encapsulate them)
>> 4. distribute the filtered block/blocklet to InputSplits by Distributor.
>> 
>> Regarding the input to the Index.filter interface, I have not decided to use the existing QueryModel or create a new cleaner QueryModel interface. If new QueryModel is desired, it should only contain filter predicate and project columns, so it is much simpler than current QueryModel. But I see current QueryModel is used in Compaction also, so I think it is better to do this clean up later?
>> 
>> 
>> Does this look fine to you? Any suggestion is welcome.
>> 
>> Regards,
>> Jacky
>> 
>> 
>>> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
>>> 
>>> Yes Jacky, interfaces needs to be revisited.
>>> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
>>> Also multi-column index(composite index) needs to be considered.
>>> 
>>> Regards,
>>> Ramana
>>> 
>>> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
>>> 
>>>> Hi community,
>>>> 
>>>>  Currently CarbonData have builtin index support which is one of the key
>>>> strength of CarbonData. Using index, CarbonData can do very fast filter
>>>> query by filtering on block and blocklet level. However, it also introduces
>>>> memory consumption of the index tree and impact first query time because
>>>> the
>>>> process of loading of index from file footer into memory. On the other
>>>> side,
>>>> in a multi-tennant environment, multiple applications may access data files
>>>> simultaneously, which again exacerbate this resource consumption issue.
>>>>  So, I want to propose and discuss a solution with you to solve this
>>>> problem and make an abstraction of interface for CarbonData's future
>>>> evolvement.
>>>>  I am thinking the final result of this work should achieve at least two
>>>> goals:
>>>> 
>>>> Goal 1: User can choose the place to store Index data, it can be stored in
>>>> processing framework's memory space (like in spark driver memory) or in
>>>> another service outside of the processing framework (like using a
>>>> independent database service)
>>>> 
>>>> Goal 2: Developer can add more index of his choice to CarbonData files.
>>>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>>>> developers are free to add other indexing technology to make certain
>>>> workload faster. These new indices should be added in a pluggable way.
>>>> 
>>>>   In order to achieve these goals, an abstraction need to be created for
>>>> CarbonData project, including:
>>>> 
>>>> - Segment: each segment is presenting one load of data, and tie with some
>>>> indices created with this load
>>>> 
>>>> - Index: index is created when this segment is created, and is leveraged
>>>> when CarbonInputFormat's getSplit is called, to filter out the required
>>>> blocks or even blocklets.
>>>> 
>>>> - CarbonInputFormat: There maybe n number of indices created for data file,
>>>> when querying these data files, InputFormat should know how to access these
>>>> indices, and initialize or loading these index if required.
>>>> 
>>>>  Obviously, this work should be separated into different tasks and
>>>> implemented gradually. But first of all, let's discuss on the goal and the
>>>> proposed approach. What is your idea?
>>>> 
>>>> Regards,
>>>> Jacky
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://apache-carbondata-
>>>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>>>> CarbonData-s-Index-Interface-tp1587.html
>>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>>> at Nabble.com.
>>>> 
>> 
>

Re: Abstracting CarbonData's Index Interface

Posted by Qingqing Zhou <zh...@gmail.com>.

On Mon, Oct 3, 2016 at 9:14 PM, Jacky Li <ja...@qq.com> wrote:
> If data is append only, I think read and load of index is enough. I
> think you are right, we should have a clean Index API, currently I am
> thinking of read and load API only:

I am not super fan for external index at this stage. Comparing to
"database" style supporting of external index involving complex/high
performance transaction requirements, we can alleviate a bit. But still,
even for loading, basic transaction requirement like atomic is still
needed. For now, we can focus more on internal index interface, so carbon
kernel developers can introduce new index easier.

Regards,
Qingqing

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

> 在 2016年10月4日，上午8:01，Jihong Ma <Ji...@huawei.com> 写道：
> 
> It is a great idea to open the door for more flexible/scalable way of accessing index to help with query processing.  if our goals are as following: 
> 
>> Goal 1: User can choose the place to store Index data, it can be stored in processing framework's memory space (like in spark driver memory) or in 
> another service outside of the processing framework (like using a independent database service)
> 
>  the current approach of leveraging index for filter predicate/column projection pushdown through index is invisible to optimizer, since index can live outside of the current processing framework and physically separated from the base data,  I am inclining with exposing index scan visible to optimizer, and allow optimizer to weight in and decide which index to pick,  considering we would also like to extend the support beyond b+ tree as described in Goal 2).  
> 

I think the bottom line is that CarbonData index should be easy enough to integrate into different processing frameworks, which I think InputFormat is the best choice as of now. Integration in this level, enable all processing framework can leverage CarbonData’s index capability without any plugin code in the upper framework.

The next level of integration is in optimizer level, in this case, unless the optimizer is capable of evaluating cost of using different indices, exposing the index as API to optimizer is meaningful. I think we can always to do that later on, if we have added more indices to CarbonData and come up with a neat API for optimizer to use.

> 
>> Goal 2: Developer can add more index of his choice to CarbonData files.
> Besides B+ tree on multi-dimensional key which current CarbonData supports, developers are free to add other indexing technology to make certain workload faster. These new indices should be added in a pluggable way.
> 
>  Great to make it easily extensible, we can define a clean abstract API to unify the interface (build/insert/delete/scan..), refer to PostgreSQL on this matter(thanks Qingqing for sharing the link):
>  https://www.postgresql.org/docs/8.4/static/indexam.html.
> 

I think you are right, we should have a clean Index API, currently I am thinking of read and load API only: https://github.com/apache/incubator-carbondata/pull/208/files#diff-ec23f9a4b91d70309522eef78689be1c <https://github.com/apache/incubator-carbondata/pull/208/files#diff-ec23f9a4b91d70309522eef78689be1c>

For the traditional database like PostgreSQL, since they need to deal with transactions so their index API need to consider transaction as well, while in CarbonData,  I am not sure how far we can go and how much mutability we can add to Index API. Currently, its model is using Segment concept to do incremental load, so Index is immutable. If data is append only, I think read and load of index is enough. But if we introduce update into CarbonData, then index should be updatable as well, I think it is better to consider this together in data update feature if there are any in the future?

> 
> Jenny
> 
> -----Original Message-----
> From: Jacky Li [mailto:jacky.likun@qq.com] 
> Sent: Sunday, October 02, 2016 10:25 PM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: Abstracting CarbonData's Index Interface
> 
> After a second thought regarding the index part, another option is that to have a very simple Segment definition which can only list all files it has or listFile taking the QueryModel as input, implementation of Segment can be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In future, developer is free to create MultiIndexSegment to select index internally. Is this option better?
> 
> Regards,
> Jacky
> 
>> 在 2016年10月3日，上午11:00，Jacky Li <ja...@qq.com> 写道：
>> 
>> I am currently thinking these abstractions:
>> 
>> - A SegmentManager is the global manager of all segments for one table. It can be used to get all segments and manage the segment while loading and compaction.
>> - A CarbonInputFormat will take the input of table path, so means it represent the whole table contain all segments.  When getSplit is called, it will get all segments by calling SegmentManager interface.
>> - Each Segment contains a list of Index, and an IndexSelector. While currently CarbonData only has MDK index, developer can create multiple indices for each segment in the future.
>> - An Index is an interface to filtering on block/blocklet, and provide this functionality only.  Implementation should hide all complexity like deciding where to store the index.
>> - An IndexSelector is an interface to choose which index to use based on query predicates. Default implementation is to choose the first index. An implementation of IndexChooser can also decide not to use index at all.
>> - A Distributor is used to map the filtered block/blocklet to InputSplits. Implementation can take number of node, parallelism into consideration. It can also decide to distribute tasks based on block or blocklet.
>> 
>> So the main concepts are SegmentManager, Segment, Index, IndexSelector, InputFormat/OutputFormat, Distributor.
>> 
>> There will be a default implementation of CarbonInputFormat whose getSplit will do the following: 
>> 1. gat all segments by calling SegmentManager
>> 2. for each segment, choose the index to use by IndexSelector
>> 3. invoke the selected Index to filter out block/blocklet (since these are two concept, maybe a parent class need to be created to encapsulate them)
>> 4. distribute the filtered block/blocklet to InputSplits by Distributor.
>> 
>> Regarding the input to the Index.filter interface, I have not decided to use the existing QueryModel or create a new cleaner QueryModel interface. If new QueryModel is desired, it should only contain filter predicate and project columns, so it is much simpler than current QueryModel. But I see current QueryModel is used in Compaction also, so I think it is better to do this clean up later?
>> 
>> 
>> Does this look fine to you? Any suggestion is welcome.
>> 
>> Regards,
>> Jacky
>> 
>> 
>>> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
>>> 
>>> Yes Jacky, interfaces needs to be revisited.
>>> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
>>> Also multi-column index(composite index) needs to be considered.
>>> 
>>> Regards,
>>> Ramana
>>> 
>>> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
>>> 
>>>> Hi community,
>>>> 
>>>>  Currently CarbonData have builtin index support which is one of the key
>>>> strength of CarbonData. Using index, CarbonData can do very fast filter
>>>> query by filtering on block and blocklet level. However, it also introduces
>>>> memory consumption of the index tree and impact first query time because
>>>> the
>>>> process of loading of index from file footer into memory. On the other
>>>> side,
>>>> in a multi-tennant environment, multiple applications may access data files
>>>> simultaneously, which again exacerbate this resource consumption issue.
>>>>  So, I want to propose and discuss a solution with you to solve this
>>>> problem and make an abstraction of interface for CarbonData's future
>>>> evolvement.
>>>>  I am thinking the final result of this work should achieve at least two
>>>> goals:
>>>> 
>>>> Goal 1: User can choose the place to store Index data, it can be stored in
>>>> processing framework's memory space (like in spark driver memory) or in
>>>> another service outside of the processing framework (like using a
>>>> independent database service)
>>>> 
>>>> Goal 2: Developer can add more index of his choice to CarbonData files.
>>>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>>>> developers are free to add other indexing technology to make certain
>>>> workload faster. These new indices should be added in a pluggable way.
>>>> 
>>>>   In order to achieve these goals, an abstraction need to be created for
>>>> CarbonData project, including:
>>>> 
>>>> - Segment: each segment is presenting one load of data, and tie with some
>>>> indices created with this load
>>>> 
>>>> - Index: index is created when this segment is created, and is leveraged
>>>> when CarbonInputFormat's getSplit is called, to filter out the required
>>>> blocks or even blocklets.
>>>> 
>>>> - CarbonInputFormat: There maybe n number of indices created for data file,
>>>> when querying these data files, InputFormat should know how to access these
>>>> indices, and initialize or loading these index if required.
>>>> 
>>>>  Obviously, this work should be separated into different tasks and
>>>> implemented gradually. But first of all, let's discuss on the goal and the
>>>> proposed approach. What is your idea?
>>>> 
>>>> Regards,
>>>> Jacky
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: http://apache-carbondata-
>>>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>>>> CarbonData-s-Index-Interface-tp1587.html
>>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>>> at Nabble.com.
>>>> 
>> 
>

RE: Abstracting CarbonData's Index Interface

Posted by Jihong Ma <Ji...@huawei.com>.

It is a great idea to open the door for more flexible/scalable way of accessing index to help with query processing.  if our goals are as following: 

> Goal 1: User can choose the place to store Index data, it can be stored in processing framework's memory space (like in spark driver memory) or in 
another service outside of the processing framework (like using a independent database service)

  the current approach of leveraging index for filter predicate/column projection pushdown through index is invisible to optimizer, since index can live outside of the current processing framework and physically separated from the base data,  I am inclining with exposing index scan visible to optimizer, and allow optimizer to weight in and decide which index to pick,  considering we would also like to extend the support beyond b+ tree as described in Goal 2).  
  

> Goal 2: Developer can add more index of his choice to CarbonData files.
Besides B+ tree on multi-dimensional key which current CarbonData supports, developers are free to add other indexing technology to make certain workload faster. These new indices should be added in a pluggable way.

  Great to make it easily extensible, we can define a clean abstract API to unify the interface (build/insert/delete/scan..), refer to PostgreSQL on this matter(thanks Qingqing for sharing the link):
  https://www.postgresql.org/docs/8.4/static/indexam.html.


Jenny

-----Original Message-----
From: Jacky Li [mailto:jacky.likun@qq.com] 
Sent: Sunday, October 02, 2016 10:25 PM
To: dev@carbondata.incubator.apache.org
Subject: Re: Abstracting CarbonData's Index Interface

After a second thought regarding the index part, another option is that to have a very simple Segment definition which can only list all files it has or listFile taking the QueryModel as input, implementation of Segment can be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In future, developer is free to create MultiIndexSegment to select index internally. Is this option better?

Regards,
Jacky

> 在 2016年10月3日，上午11:00，Jacky Li <ja...@qq.com> 写道：
> 
> I am currently thinking these abstractions:
> 
> - A SegmentManager is the global manager of all segments for one table. It can be used to get all segments and manage the segment while loading and compaction.
> - A CarbonInputFormat will take the input of table path, so means it represent the whole table contain all segments.  When getSplit is called, it will get all segments by calling SegmentManager interface.
> - Each Segment contains a list of Index, and an IndexSelector. While currently CarbonData only has MDK index, developer can create multiple indices for each segment in the future.
> - An Index is an interface to filtering on block/blocklet, and provide this functionality only.  Implementation should hide all complexity like deciding where to store the index.
> - An IndexSelector is an interface to choose which index to use based on query predicates. Default implementation is to choose the first index. An implementation of IndexChooser can also decide not to use index at all.
> - A Distributor is used to map the filtered block/blocklet to InputSplits. Implementation can take number of node, parallelism into consideration. It can also decide to distribute tasks based on block or blocklet.
> 
> So the main concepts are SegmentManager, Segment, Index, IndexSelector, InputFormat/OutputFormat, Distributor.
> 
> There will be a default implementation of CarbonInputFormat whose getSplit will do the following: 
> 1. gat all segments by calling SegmentManager
> 2. for each segment, choose the index to use by IndexSelector
> 3. invoke the selected Index to filter out block/blocklet (since these are two concept, maybe a parent class need to be created to encapsulate them)
> 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> 
> Regarding the input to the Index.filter interface, I have not decided to use the existing QueryModel or create a new cleaner QueryModel interface. If new QueryModel is desired, it should only contain filter predicate and project columns, so it is much simpler than current QueryModel. But I see current QueryModel is used in Compaction also, so I think it is better to do this clean up later?
> 
> 
> Does this look fine to you? Any suggestion is welcome.
> 
> Regards,
> Jacky
> 
> 
>> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
>> 
>> Yes Jacky, interfaces needs to be revisited.
>> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
>> Also multi-column index(composite index) needs to be considered.
>> 
>> Regards,
>> Ramana
>> 
>> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
>> 
>>> Hi community,
>>> 
>>>   Currently CarbonData have builtin index support which is one of the key
>>> strength of CarbonData. Using index, CarbonData can do very fast filter
>>> query by filtering on block and blocklet level. However, it also introduces
>>> memory consumption of the index tree and impact first query time because
>>> the
>>> process of loading of index from file footer into memory. On the other
>>> side,
>>> in a multi-tennant environment, multiple applications may access data files
>>> simultaneously, which again exacerbate this resource consumption issue.
>>>   So, I want to propose and discuss a solution with you to solve this
>>> problem and make an abstraction of interface for CarbonData's future
>>> evolvement.
>>>   I am thinking the final result of this work should achieve at least two
>>> goals:
>>> 
>>> Goal 1: User can choose the place to store Index data, it can be stored in
>>> processing framework's memory space (like in spark driver memory) or in
>>> another service outside of the processing framework (like using a
>>> independent database service)
>>> 
>>> Goal 2: Developer can add more index of his choice to CarbonData files.
>>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>>> developers are free to add other indexing technology to make certain
>>> workload faster. These new indices should be added in a pluggable way.
>>> 
>>>    In order to achieve these goals, an abstraction need to be created for
>>> CarbonData project, including:
>>> 
>>> - Segment: each segment is presenting one load of data, and tie with some
>>> indices created with this load
>>> 
>>> - Index: index is created when this segment is created, and is leveraged
>>> when CarbonInputFormat's getSplit is called, to filter out the required
>>> blocks or even blocklets.
>>> 
>>> - CarbonInputFormat: There maybe n number of indices created for data file,
>>> when querying these data files, InputFormat should know how to access these
>>> indices, and initialize or loading these index if required.
>>> 
>>>   Obviously, this work should be separated into different tasks and
>>> implemented gradually. But first of all, let's discuss on the goal and the
>>> proposed approach. What is your idea?
>>> 
>>> Regards,
>>> Jacky
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-carbondata-
>>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>>> CarbonData-s-Index-Interface-tp1587.html
>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>> at Nabble.com.
>>> 
>

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

I have created a JIRA and a PR for this:

CARBONDATA-284 (https://issues.apache.org/jira/browse/CARBONDATA-284)
PR208 (https://github.com/apache/incubator-carbondata/pull/208)

Please review the interface

Regards,
Jacky




--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1608.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

Sure, I think what I am doing will not affect how index is stored and load for the current in memory B tree approach, I am only adding the interface for it. You can go ahead and continue your part. 

Regards,
Jacky

> 在 2016年10月3日，下午6:36，Kumar Vishal [via Apache CarbonData Mailing List archive] <ml...@n5.nabble.com> 写道：
> 
> Hi Jacky, 
>             I am  also changing the carbondata file thrift structure to 
> read less and only required data while loading the btree, Main changes will 
> be removing the data chunk from blocklet info and keeping only the offset 
> of the data chunk and some of the redundant information which is already 
> present in the carbonindex file i am removing from carbon data file like 
> segment info. Only required and valid data chunk will be read during 
> scanning. 
> 
> -Regards 
> Kumar Vishal 
> 
> On Mon, Oct 3, 2016 at 1:08 PM, Jacky Li <[hidden email] <x-msg://10/user/SendEmail.jtp?type=node&node=1600&i=0>> wrote: 
> 
> > Agreed. Shall I create a JIRA issue and PR for this abstraction? 
> > I think reviewing on the interface code will be clearer. 
> > 
> > Regards, 
> > Jacky 
> > 
> > > 在 2016年10月3日，下午2:38，Aniket Adnaik [via Apache CarbonData Mailing List 
> > archive] <[hidden email] <x-msg://10/user/SendEmail.jtp?type=node&node=1600&i=1>> 写道： 
> > > 
> > > I would agree with having simple segment definition. Segment can use a 
> > > metadata info that describes the segment - For example; Segment type, 
> > index 
> > > availability, index type, index storage type (attached or 
> > > detached/secondary) etc. For streaming ingest segment, it also may 
> > possibly 
> > > contain min-max kind of information for each blocklet, that can used for 
> > > indexing. 
> > > So implementation details of different segment types can be hidden from 
> > > user. 
> > > We may have to think about partitioning support along with load segments 
> > in 
> > > future. 
> > > 
> > > Best Regards, 
> > > Aniket 
> > > 
> > > 
> > > 
> > > On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <[hidden email] 
> > <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=0>> wrote: 
> > > 
> > > > After a second thought regarding the index part, another option is 
> > that to 
> > > > have a very simple Segment definition which can only list all files it 
> > has 
> > > > or listFile taking the QueryModel as input, implementation of Segment 
> > can 
> > > > be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In 
> > > > future, developer is free to create MultiIndexSegment to select index 
> > > > internally. Is this option better? 
> > > > 
> > > > Regards, 
> > > > Jacky 
> > > > 
> > > > > 在 2016年10月3日，上午11:00，Jacky Li <[hidden email] 
> > <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1>> 写道： 
> > > > > 
> > > > > I am currently thinking these abstractions: 
> > > > > 
> > > > > - A SegmentManager is the global manager of all segments for one 
> > table. 
> > > > It can be used to get all segments and manage the segment while 
> > loading and 
> > > > compaction. 
> > > > > - A CarbonInputFormat will take the input of table path, so means it 
> > > > represent the whole table contain all segments.  When getSplit is 
> > called, 
> > > > it will get all segments by calling SegmentManager interface. 
> > > > > - Each Segment contains a list of Index, and an IndexSelector. While 
> > > > currently CarbonData only has MDK index, developer can create multiple 
> > > > indices for each segment in the future. 
> > > > > - An Index is an interface to filtering on block/blocklet, and 
> > provide 
> > > > this functionality only.  Implementation should hide all complexity 
> > like 
> > > > deciding where to store the index. 
> > > > > - An IndexSelector is an interface to choose which index to use 
> > based on 
> > > > query predicates. Default implementation is to choose the first index. 
> > An 
> > > > implementation of IndexChooser can also decide not to use index at all. 
> > > > > - A Distributor is used to map the filtered block/blocklet to 
> > > > InputSplits. Implementation can take number of node, parallelism into 
> > > > consideration. It can also decide to distribute tasks based on block or 
> > > > blocklet. 
> > > > > 
> > > > > So the main concepts are SegmentManager, Segment, Index, 
> > IndexSelector, 
> > > > InputFormat/OutputFormat, Distributor. 
> > > > > 
> > > > > There will be a default implementation of CarbonInputFormat whose 
> > > > getSplit will do the following: 
> > > > > 1. gat all segments by calling SegmentManager 
> > > > > 2. for each segment, choose the index to use by IndexSelector 
> > > > > 3. invoke the selected Index to filter out block/blocklet (since 
> > these 
> > > > are two concept, maybe a parent class need to be created to encapsulate 
> > > > them) 
> > > > > 4. distribute the filtered block/blocklet to InputSplits by 
> > Distributor. 
> > > > > 
> > > > > Regarding the input to the Index.filter interface, I have not 
> > decided to 
> > > > use the existing QueryModel or create a new cleaner QueryModel 
> > interface. 
> > > > If new QueryModel is desired, it should only contain filter predicate 
> > and 
> > > > project columns, so it is much simpler than current QueryModel. But I 
> > see 
> > > > current QueryModel is used in Compaction also, so I think it is better 
> > to 
> > > > do this clean up later? 
> > > > > 
> > > > > 
> > > > > Does this look fine to you? Any suggestion is welcome. 
> > > > > 
> > > > > Regards, 
> > > > > Jacky 
> > > > > 
> > > > > 
> > > > >> 在 2016年10月3日，上午2:18，Venkata Gollamudi <[hidden email] 
> > <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2>> 写道： 
> > > > >> 
> > > > >> Yes Jacky, interfaces needs to be revisited. 
> > > > >> For Goal 1 and Goal 2: abstraction required for both Index and Index 
> > > > store. 
> > > > >> Also multi-column index(composite index) needs to be considered. 
> > > > >> 
> > > > >> Regards, 
> > > > >> Ramana 
> > > > >> 
> > > > >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email] 
> > <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3>> wrote: 
> > > > >> 
> > > > >>> Hi community, 
> > > > >>> 
> > > > >>>   Currently CarbonData have builtin index support which is one of 
> > the 
> > > > key 
> > > > >>> strength of CarbonData. Using index, CarbonData can do very fast 
> > filter 
> > > > >>> query by filtering on block and blocklet level. However, it also 
> > > > introduces 
> > > > >>> memory consumption of the index tree and impact first query time 
> > > > because 
> > > > >>> the 
> > > > >>> process of loading of index from file footer into memory. On the 
> > other 
> > > > >>> side, 
> > > > >>> in a multi-tennant environment, multiple applications may access 
> > data 
> > > > files 
> > > > >>> simultaneously, which again exacerbate this resource consumption 
> > issue. 
> > > > >>>   So, I want to propose and discuss a solution with you to solve 
> > this 
> > > > >>> problem and make an abstraction of interface for CarbonData's 
> > future 
> > > > >>> evolvement. 
> > > > >>>   I am thinking the final result of this work should achieve at 
> > least 
> > > > two 
> > > > >>> goals: 
> > > > >>> 
> > > > >>> Goal 1: User can choose the place to store Index data, it can be 
> > > > stored in 
> > > > >>> processing framework's memory space (like in spark driver memory) 
> > or in 
> > > > >>> another service outside of the processing framework (like using a 
> > > > >>> independent database service) 
> > > > >>> 
> > > > >>> Goal 2: Developer can add more index of his choice to CarbonData 
> > files. 
> > > > >>> Besides B+ tree on multi-dimensional key which current CarbonData 
> > > > supports, 
> > > > >>> developers are free to add other indexing technology to make 
> > certain 
> > > > >>> workload faster. These new indices should be added in a pluggable 
> > way. 
> > > > >>> 
> > > > >>>    In order to achieve these goals, an abstraction need to be 
> > created 
> > > > for 
> > > > >>> CarbonData project, including: 
> > > > >>> 
> > > > >>> - Segment: each segment is presenting one load of data, and tie 
> > with 
> > > > some 
> > > > >>> indices created with this load 
> > > > >>> 
> > > > >>> - Index: index is created when this segment is created, and is 
> > > > leveraged 
> > > > >>> when CarbonInputFormat's getSplit is called, to filter out the 
> > required 
> > > > >>> blocks or even blocklets. 
> > > > >>> 
> > > > >>> - CarbonInputFormat: There maybe n number of indices created for 
> > data 
> > > > file, 
> > > > >>> when querying these data files, InputFormat should know how to 
> > access 
> > > > these 
> > > > >>> indices, and initialize or loading these index if required. 
> > > > >>> 
> > > > >>>   Obviously, this work should be separated into different tasks and 
> > > > >>> implemented gradually. But first of all, let's discuss on the goal 
> > and 
> > > > the 
> > > > >>> proposed approach. What is your idea? 
> > > > >>> 
> > > > >>> Regards, 
> > > > >>> Jacky 
> > > > >>> 
> > > > >>> 
> > > > >>> 
> > > > >>> 
> > > > >>> 
> > > > >>> -- 
> > > > >>> View this message in context: http://apache-carbondata- <http://apache-carbondata-/> < 
> > http://apache-carbondata-/ <http://apache-carbondata-/>> 
> > > > >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting- < 
> > http://mailing-list-archive.1130556.n5.nabble.com/Abstracting- <http://mailing-list-archive.1130556.n5.nabble.com/Abstracting->> 
> > > > >>> CarbonData-s-Index-Interface-tp1587.html 
> > > > >>> Sent from the Apache CarbonData Mailing List archive mailing list 
> > > > archive 
> > > > >>> at Nabble.com <http://nabble.com/ <http://nabble.com/>>. 
> > > > >>> 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > If you reply to this email, your message will be added to the discussion 
> > below: 
> > > http://apache-carbondata-mailing-list-archive.1130556 <http://apache-carbondata-mailing-list-archive.1130556/>. 
> > n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html < 
> > http://apache-carbondata-mailing-list-archive.1130556 <http://apache-carbondata-mailing-list-archive.1130556/>. 
> > n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html> 
> > > To unsubscribe from Abstracting CarbonData's Index Interface, click here 
> > <http://apache-carbondata-mailing-list-archive.1130556 <http://apache-carbondata-mailing-list-archive.1130556/>. 
> > n5.nabble.com/template/NamlServlet.jtp?macro= 
> > unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1OD 
> > d8LTEyNTA5Nzc4Mjg=>. 
> > > NAML <http://apache-carbondata-mailing-list-archive.1130556 <http://apache-carbondata-mailing-list-archive.1130556/>. 
> > n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html% 
> > 21nabble%3Aemail.naml&base=nabble.naml.namespaces. 
> > BasicNamespace-nabble.view.web.template.NabbleNamespace- 
> > nabble.naml.namespaces.BasicNamespace-nabble.view. 
> > web.template.NabbleNamespace-nabble.naml.namespaces. 
> > BasicNamespace-nabble.view.web.template.NabbleNamespace- 
> > nabble.naml.namespaces.BasicNamespace-nabble.view. 
> > web.template.NabbleNamespace-nabble.naml.namespaces. 
> > BasicNamespace-nabble.view.web.template.NabbleNamespace- 
> > nabble.view.web.template.NodeNamespace&breadcrumbs= 
> > notify_subscribers%21nabble%3Aemail.naml-instant_emails% 
> > 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 
> > 
> > 
> > 
> > 
> > -- 
> > View this message in context: http://apache-carbondata- <http://apache-carbondata-/>
> > mailing-list-archive.1130556.n5.nabble.com/Abstracting- 
> > CarbonData-s-Index-Interface-tp1587p1599.html 
> > Sent from the Apache CarbonData Mailing List archive mailing list archive 
> > at Nabble.com. 
> > 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1600.html <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1600.html>
> To unsubscribe from Abstracting CarbonData's Index Interface, click here <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1ODd8LTEyNTA5Nzc4Mjg=>.
> NAML <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1601.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: Abstracting CarbonData's Index Interface

Posted by Kumar Vishal <ku...@gmail.com>.

Hi Jacky,
            I am  also changing the carbondata file thrift structure to
read less and only required data while loading the btree, Main changes will
be removing the data chunk from blocklet info and keeping only the offset
of the data chunk and some of the redundant information which is already
present in the carbonindex file i am removing from carbon data file like
segment info. Only required and valid data chunk will be read during
scanning.

-Regards
Kumar Vishal

On Mon, Oct 3, 2016 at 1:08 PM, Jacky Li <ja...@qq.com> wrote:

> Agreed. Shall I create a JIRA issue and PR for this abstraction?
> I think reviewing on the interface code will be clearer.
>
> Regards,
> Jacky
>
> > 在 2016年10月3日，下午2:38，Aniket Adnaik [via Apache CarbonData Mailing List
> archive] <ml...@n5.nabble.com> 写道：
> >
> > I would agree with having simple segment definition. Segment can use a
> > metadata info that describes the segment - For example; Segment type,
> index
> > availability, index type, index storage type (attached or
> > detached/secondary) etc. For streaming ingest segment, it also may
> possibly
> > contain min-max kind of information for each blocklet, that can used for
> > indexing.
> > So implementation details of different segment types can be hidden from
> > user.
> > We may have to think about partitioning support along with load segments
> in
> > future.
> >
> > Best Regards,
> > Aniket
> >
> >
> >
> > On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=0>> wrote:
> >
> > > After a second thought regarding the index part, another option is
> that to
> > > have a very simple Segment definition which can only list all files it
> has
> > > or listFile taking the QueryModel as input, implementation of Segment
> can
> > > be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> > > future, developer is free to create MultiIndexSegment to select index
> > > internally. Is this option better?
> > >
> > > Regards,
> > > Jacky
> > >
> > > > 在 2016年10月3日，上午11:00，Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1>> 写道：
> > > >
> > > > I am currently thinking these abstractions:
> > > >
> > > > - A SegmentManager is the global manager of all segments for one
> table.
> > > It can be used to get all segments and manage the segment while
> loading and
> > > compaction.
> > > > - A CarbonInputFormat will take the input of table path, so means it
> > > represent the whole table contain all segments.  When getSplit is
> called,
> > > it will get all segments by calling SegmentManager interface.
> > > > - Each Segment contains a list of Index, and an IndexSelector. While
> > > currently CarbonData only has MDK index, developer can create multiple
> > > indices for each segment in the future.
> > > > - An Index is an interface to filtering on block/blocklet, and
> provide
> > > this functionality only.  Implementation should hide all complexity
> like
> > > deciding where to store the index.
> > > > - An IndexSelector is an interface to choose which index to use
> based on
> > > query predicates. Default implementation is to choose the first index.
> An
> > > implementation of IndexChooser can also decide not to use index at all.
> > > > - A Distributor is used to map the filtered block/blocklet to
> > > InputSplits. Implementation can take number of node, parallelism into
> > > consideration. It can also decide to distribute tasks based on block or
> > > blocklet.
> > > >
> > > > So the main concepts are SegmentManager, Segment, Index,
> IndexSelector,
> > > InputFormat/OutputFormat, Distributor.
> > > >
> > > > There will be a default implementation of CarbonInputFormat whose
> > > getSplit will do the following:
> > > > 1. gat all segments by calling SegmentManager
> > > > 2. for each segment, choose the index to use by IndexSelector
> > > > 3. invoke the selected Index to filter out block/blocklet (since
> these
> > > are two concept, maybe a parent class need to be created to encapsulate
> > > them)
> > > > 4. distribute the filtered block/blocklet to InputSplits by
> Distributor.
> > > >
> > > > Regarding the input to the Index.filter interface, I have not
> decided to
> > > use the existing QueryModel or create a new cleaner QueryModel
> interface.
> > > If new QueryModel is desired, it should only contain filter predicate
> and
> > > project columns, so it is much simpler than current QueryModel. But I
> see
> > > current QueryModel is used in Compaction also, so I think it is better
> to
> > > do this clean up later?
> > > >
> > > >
> > > > Does this look fine to you? Any suggestion is welcome.
> > > >
> > > > Regards,
> > > > Jacky
> > > >
> > > >
> > > >> 在 2016年10月3日，上午2:18，Venkata Gollamudi <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2>> 写道：
> > > >>
> > > >> Yes Jacky, interfaces needs to be revisited.
> > > >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> > > store.
> > > >> Also multi-column index(composite index) needs to be considered.
> > > >>
> > > >> Regards,
> > > >> Ramana
> > > >>
> > > >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3>> wrote:
> > > >>
> > > >>> Hi community,
> > > >>>
> > > >>>   Currently CarbonData have builtin index support which is one of
> the
> > > key
> > > >>> strength of CarbonData. Using index, CarbonData can do very fast
> filter
> > > >>> query by filtering on block and blocklet level. However, it also
> > > introduces
> > > >>> memory consumption of the index tree and impact first query time
> > > because
> > > >>> the
> > > >>> process of loading of index from file footer into memory. On the
> other
> > > >>> side,
> > > >>> in a multi-tennant environment, multiple applications may access
> data
> > > files
> > > >>> simultaneously, which again exacerbate this resource consumption
> issue.
> > > >>>   So, I want to propose and discuss a solution with you to solve
> this
> > > >>> problem and make an abstraction of interface for CarbonData's
> future
> > > >>> evolvement.
> > > >>>   I am thinking the final result of this work should achieve at
> least
> > > two
> > > >>> goals:
> > > >>>
> > > >>> Goal 1: User can choose the place to store Index data, it can be
> > > stored in
> > > >>> processing framework's memory space (like in spark driver memory)
> or in
> > > >>> another service outside of the processing framework (like using a
> > > >>> independent database service)
> > > >>>
> > > >>> Goal 2: Developer can add more index of his choice to CarbonData
> files.
> > > >>> Besides B+ tree on multi-dimensional key which current CarbonData
> > > supports,
> > > >>> developers are free to add other indexing technology to make
> certain
> > > >>> workload faster. These new indices should be added in a pluggable
> way.
> > > >>>
> > > >>>    In order to achieve these goals, an abstraction need to be
> created
> > > for
> > > >>> CarbonData project, including:
> > > >>>
> > > >>> - Segment: each segment is presenting one load of data, and tie
> with
> > > some
> > > >>> indices created with this load
> > > >>>
> > > >>> - Index: index is created when this segment is created, and is
> > > leveraged
> > > >>> when CarbonInputFormat's getSplit is called, to filter out the
> required
> > > >>> blocks or even blocklets.
> > > >>>
> > > >>> - CarbonInputFormat: There maybe n number of indices created for
> data
> > > file,
> > > >>> when querying these data files, InputFormat should know how to
> access
> > > these
> > > >>> indices, and initialize or loading these index if required.
> > > >>>
> > > >>>   Obviously, this work should be separated into different tasks and
> > > >>> implemented gradually. But first of all, let's discuss on the goal
> and
> > > the
> > > >>> proposed approach. What is your idea?
> > > >>>
> > > >>> Regards,
> > > >>> Jacky
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> View this message in context: http://apache-carbondata- <
> http://apache-carbondata-/>
> > > >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting- <
> http://mailing-list-archive.1130556.n5.nabble.com/Abstracting->
> > > >>> CarbonData-s-Index-Interface-tp1587.html
> > > >>> Sent from the Apache CarbonData Mailing List archive mailing list
> > > archive
> > > >>> at Nabble.com <http://nabble.com/>.
> > > >>>
> > > >
> > >
> > >
> >
> >
> > If you reply to this email, your message will be added to the discussion
> below:
> > http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html <
> http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html>
> > To unsubscribe from Abstracting CarbonData's Index Interface, click here
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=
> unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1OD
> d8LTEyNTA5Nzc4Mjg=>.
> > NAML <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%
> 21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace&breadcrumbs=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587p1599.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

Agreed. Shall I create a JIRA issue and PR for this abstraction?
I think reviewing on the interface code will be clearer.

Regards,
Jacky

> 在 2016年10月3日，下午2:38，Aniket Adnaik [via Apache CarbonData Mailing List archive] <ml...@n5.nabble.com> 写道：
> 
> I would agree with having simple segment definition. Segment can use a 
> metadata info that describes the segment - For example; Segment type, index 
> availability, index type, index storage type (attached or 
> detached/secondary) etc. For streaming ingest segment, it also may possibly 
> contain min-max kind of information for each blocklet, that can used for 
> indexing. 
> So implementation details of different segment types can be hidden from 
> user. 
> We may have to think about partitioning support along with load segments in 
> future. 
> 
> Best Regards, 
> Aniket 
> 
> 
> 
> On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <[hidden email] <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=0>> wrote: 
> 
> > After a second thought regarding the index part, another option is that to 
> > have a very simple Segment definition which can only list all files it has 
> > or listFile taking the QueryModel as input, implementation of Segment can 
> > be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In 
> > future, developer is free to create MultiIndexSegment to select index 
> > internally. Is this option better? 
> > 
> > Regards, 
> > Jacky 
> > 
> > > 在 2016年10月3日，上午11:00，Jacky Li <[hidden email] <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1>> 写道： 
> > > 
> > > I am currently thinking these abstractions: 
> > > 
> > > - A SegmentManager is the global manager of all segments for one table. 
> > It can be used to get all segments and manage the segment while loading and 
> > compaction. 
> > > - A CarbonInputFormat will take the input of table path, so means it 
> > represent the whole table contain all segments.  When getSplit is called, 
> > it will get all segments by calling SegmentManager interface. 
> > > - Each Segment contains a list of Index, and an IndexSelector. While 
> > currently CarbonData only has MDK index, developer can create multiple 
> > indices for each segment in the future. 
> > > - An Index is an interface to filtering on block/blocklet, and provide 
> > this functionality only.  Implementation should hide all complexity like 
> > deciding where to store the index. 
> > > - An IndexSelector is an interface to choose which index to use based on 
> > query predicates. Default implementation is to choose the first index. An 
> > implementation of IndexChooser can also decide not to use index at all. 
> > > - A Distributor is used to map the filtered block/blocklet to 
> > InputSplits. Implementation can take number of node, parallelism into 
> > consideration. It can also decide to distribute tasks based on block or 
> > blocklet. 
> > > 
> > > So the main concepts are SegmentManager, Segment, Index, IndexSelector, 
> > InputFormat/OutputFormat, Distributor. 
> > > 
> > > There will be a default implementation of CarbonInputFormat whose 
> > getSplit will do the following: 
> > > 1. gat all segments by calling SegmentManager 
> > > 2. for each segment, choose the index to use by IndexSelector 
> > > 3. invoke the selected Index to filter out block/blocklet (since these 
> > are two concept, maybe a parent class need to be created to encapsulate 
> > them) 
> > > 4. distribute the filtered block/blocklet to InputSplits by Distributor. 
> > > 
> > > Regarding the input to the Index.filter interface, I have not decided to 
> > use the existing QueryModel or create a new cleaner QueryModel interface. 
> > If new QueryModel is desired, it should only contain filter predicate and 
> > project columns, so it is much simpler than current QueryModel. But I see 
> > current QueryModel is used in Compaction also, so I think it is better to 
> > do this clean up later? 
> > > 
> > > 
> > > Does this look fine to you? Any suggestion is welcome. 
> > > 
> > > Regards, 
> > > Jacky 
> > > 
> > > 
> > >> 在 2016年10月3日，上午2:18，Venkata Gollamudi <[hidden email] <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2>> 写道： 
> > >> 
> > >> Yes Jacky, interfaces needs to be revisited. 
> > >> For Goal 1 and Goal 2: abstraction required for both Index and Index 
> > store. 
> > >> Also multi-column index(composite index) needs to be considered. 
> > >> 
> > >> Regards, 
> > >> Ramana 
> > >> 
> > >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email] <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3>> wrote: 
> > >> 
> > >>> Hi community, 
> > >>> 
> > >>>   Currently CarbonData have builtin index support which is one of the 
> > key 
> > >>> strength of CarbonData. Using index, CarbonData can do very fast filter 
> > >>> query by filtering on block and blocklet level. However, it also 
> > introduces 
> > >>> memory consumption of the index tree and impact first query time 
> > because 
> > >>> the 
> > >>> process of loading of index from file footer into memory. On the other 
> > >>> side, 
> > >>> in a multi-tennant environment, multiple applications may access data 
> > files 
> > >>> simultaneously, which again exacerbate this resource consumption issue. 
> > >>>   So, I want to propose and discuss a solution with you to solve this 
> > >>> problem and make an abstraction of interface for CarbonData's future 
> > >>> evolvement. 
> > >>>   I am thinking the final result of this work should achieve at least 
> > two 
> > >>> goals: 
> > >>> 
> > >>> Goal 1: User can choose the place to store Index data, it can be 
> > stored in 
> > >>> processing framework's memory space (like in spark driver memory) or in 
> > >>> another service outside of the processing framework (like using a 
> > >>> independent database service) 
> > >>> 
> > >>> Goal 2: Developer can add more index of his choice to CarbonData files. 
> > >>> Besides B+ tree on multi-dimensional key which current CarbonData 
> > supports, 
> > >>> developers are free to add other indexing technology to make certain 
> > >>> workload faster. These new indices should be added in a pluggable way. 
> > >>> 
> > >>>    In order to achieve these goals, an abstraction need to be created 
> > for 
> > >>> CarbonData project, including: 
> > >>> 
> > >>> - Segment: each segment is presenting one load of data, and tie with 
> > some 
> > >>> indices created with this load 
> > >>> 
> > >>> - Index: index is created when this segment is created, and is 
> > leveraged 
> > >>> when CarbonInputFormat's getSplit is called, to filter out the required 
> > >>> blocks or even blocklets. 
> > >>> 
> > >>> - CarbonInputFormat: There maybe n number of indices created for data 
> > file, 
> > >>> when querying these data files, InputFormat should know how to access 
> > these 
> > >>> indices, and initialize or loading these index if required. 
> > >>> 
> > >>>   Obviously, this work should be separated into different tasks and 
> > >>> implemented gradually. But first of all, let's discuss on the goal and 
> > the 
> > >>> proposed approach. What is your idea? 
> > >>> 
> > >>> Regards, 
> > >>> Jacky 
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>> -- 
> > >>> View this message in context: http://apache-carbondata- <http://apache-carbondata-/>
> > >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting- <http://mailing-list-archive.1130556.n5.nabble.com/Abstracting-> 
> > >>> CarbonData-s-Index-Interface-tp1587.html 
> > >>> Sent from the Apache CarbonData Mailing List archive mailing list 
> > archive 
> > >>> at Nabble.com <http://nabble.com/>. 
> > >>> 
> > > 
> > 
> > 
> 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html>
> To unsubscribe from Abstracting CarbonData's Index Interface, click here <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1ODd8LTEyNTA5Nzc4Mjg=>.
> NAML <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1599.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Re: Abstracting CarbonData's Index Interface

Posted by Aniket Adnaik <an...@gmail.com>.

I would agree with having simple segment definition. Segment can use a
metadata info that describes the segment - For example; Segment type, index
availability, index type, index storage type (attached or
detached/secondary) etc. For streaming ingest segment, it also may possibly
contain min-max kind of information for each blocklet, that can used for
indexing.
So implementation details of different segment types can be hidden from
user.
We may have to think about partitioning support along with load segments in
future.

Best Regards,
Aniket



On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <ja...@qq.com> wrote:

> After a second thought regarding the index part, another option is that to
> have a very simple Segment definition which can only list all files it has
> or listFile taking the QueryModel as input, implementation of Segment can
> be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> future, developer is free to create MultiIndexSegment to select index
> internally. Is this option better?
>
> Regards,
> Jacky
>
> > 在 2016年10月3日，上午11:00，Jacky Li <ja...@qq.com> 写道：
> >
> > I am currently thinking these abstractions:
> >
> > - A SegmentManager is the global manager of all segments for one table.
> It can be used to get all segments and manage the segment while loading and
> compaction.
> > - A CarbonInputFormat will take the input of table path, so means it
> represent the whole table contain all segments.  When getSplit is called,
> it will get all segments by calling SegmentManager interface.
> > - Each Segment contains a list of Index, and an IndexSelector. While
> currently CarbonData only has MDK index, developer can create multiple
> indices for each segment in the future.
> > - An Index is an interface to filtering on block/blocklet, and provide
> this functionality only.  Implementation should hide all complexity like
> deciding where to store the index.
> > - An IndexSelector is an interface to choose which index to use based on
> query predicates. Default implementation is to choose the first index. An
> implementation of IndexChooser can also decide not to use index at all.
> > - A Distributor is used to map the filtered block/blocklet to
> InputSplits. Implementation can take number of node, parallelism into
> consideration. It can also decide to distribute tasks based on block or
> blocklet.
> >
> > So the main concepts are SegmentManager, Segment, Index, IndexSelector,
> InputFormat/OutputFormat, Distributor.
> >
> > There will be a default implementation of CarbonInputFormat whose
> getSplit will do the following:
> > 1. gat all segments by calling SegmentManager
> > 2. for each segment, choose the index to use by IndexSelector
> > 3. invoke the selected Index to filter out block/blocklet (since these
> are two concept, maybe a parent class need to be created to encapsulate
> them)
> > 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> >
> > Regarding the input to the Index.filter interface, I have not decided to
> use the existing QueryModel or create a new cleaner QueryModel interface.
> If new QueryModel is desired, it should only contain filter predicate and
> project columns, so it is much simpler than current QueryModel. But I see
> current QueryModel is used in Compaction also, so I think it is better to
> do this clean up later?
> >
> >
> > Does this look fine to you? Any suggestion is welcome.
> >
> > Regards,
> > Jacky
> >
> >
> >> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
> >>
> >> Yes Jacky, interfaces needs to be revisited.
> >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> store.
> >> Also multi-column index(composite index) needs to be considered.
> >>
> >> Regards,
> >> Ramana
> >>
> >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
> >>
> >>> Hi community,
> >>>
> >>>   Currently CarbonData have builtin index support which is one of the
> key
> >>> strength of CarbonData. Using index, CarbonData can do very fast filter
> >>> query by filtering on block and blocklet level. However, it also
> introduces
> >>> memory consumption of the index tree and impact first query time
> because
> >>> the
> >>> process of loading of index from file footer into memory. On the other
> >>> side,
> >>> in a multi-tennant environment, multiple applications may access data
> files
> >>> simultaneously, which again exacerbate this resource consumption issue.
> >>>   So, I want to propose and discuss a solution with you to solve this
> >>> problem and make an abstraction of interface for CarbonData's future
> >>> evolvement.
> >>>   I am thinking the final result of this work should achieve at least
> two
> >>> goals:
> >>>
> >>> Goal 1: User can choose the place to store Index data, it can be
> stored in
> >>> processing framework's memory space (like in spark driver memory) or in
> >>> another service outside of the processing framework (like using a
> >>> independent database service)
> >>>
> >>> Goal 2: Developer can add more index of his choice to CarbonData files.
> >>> Besides B+ tree on multi-dimensional key which current CarbonData
> supports,
> >>> developers are free to add other indexing technology to make certain
> >>> workload faster. These new indices should be added in a pluggable way.
> >>>
> >>>    In order to achieve these goals, an abstraction need to be created
> for
> >>> CarbonData project, including:
> >>>
> >>> - Segment: each segment is presenting one load of data, and tie with
> some
> >>> indices created with this load
> >>>
> >>> - Index: index is created when this segment is created, and is
> leveraged
> >>> when CarbonInputFormat's getSplit is called, to filter out the required
> >>> blocks or even blocklets.
> >>>
> >>> - CarbonInputFormat: There maybe n number of indices created for data
> file,
> >>> when querying these data files, InputFormat should know how to access
> these
> >>> indices, and initialize or loading these index if required.
> >>>
> >>>   Obviously, this work should be separated into different tasks and
> >>> implemented gradually. But first of all, let's discuss on the goal and
> the
> >>> proposed approach. What is your idea?
> >>>
> >>> Regards,
> >>> Jacky
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context: http://apache-carbondata-
> >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> >>> CarbonData-s-Index-Interface-tp1587.html
> >>> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >>> at Nabble.com.
> >>>
> >
>
>

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

After a second thought regarding the index part, another option is that to have a very simple Segment definition which can only list all files it has or listFile taking the QueryModel as input, implementation of Segment can be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In future, developer is free to create MultiIndexSegment to select index internally. Is this option better?

Regards,
Jacky

> 在 2016年10月3日，上午11:00，Jacky Li <ja...@qq.com> 写道：
> 
> I am currently thinking these abstractions:
> 
> - A SegmentManager is the global manager of all segments for one table. It can be used to get all segments and manage the segment while loading and compaction.
> - A CarbonInputFormat will take the input of table path, so means it represent the whole table contain all segments.  When getSplit is called, it will get all segments by calling SegmentManager interface.
> - Each Segment contains a list of Index, and an IndexSelector. While currently CarbonData only has MDK index, developer can create multiple indices for each segment in the future.
> - An Index is an interface to filtering on block/blocklet, and provide this functionality only.  Implementation should hide all complexity like deciding where to store the index.
> - An IndexSelector is an interface to choose which index to use based on query predicates. Default implementation is to choose the first index. An implementation of IndexChooser can also decide not to use index at all.
> - A Distributor is used to map the filtered block/blocklet to InputSplits. Implementation can take number of node, parallelism into consideration. It can also decide to distribute tasks based on block or blocklet.
> 
> So the main concepts are SegmentManager, Segment, Index, IndexSelector, InputFormat/OutputFormat, Distributor.
> 
> There will be a default implementation of CarbonInputFormat whose getSplit will do the following: 
> 1. gat all segments by calling SegmentManager
> 2. for each segment, choose the index to use by IndexSelector
> 3. invoke the selected Index to filter out block/blocklet (since these are two concept, maybe a parent class need to be created to encapsulate them)
> 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> 
> Regarding the input to the Index.filter interface, I have not decided to use the existing QueryModel or create a new cleaner QueryModel interface. If new QueryModel is desired, it should only contain filter predicate and project columns, so it is much simpler than current QueryModel. But I see current QueryModel is used in Compaction also, so I think it is better to do this clean up later?
> 
> 
> Does this look fine to you? Any suggestion is welcome.
> 
> Regards,
> Jacky
> 
> 
>> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
>> 
>> Yes Jacky, interfaces needs to be revisited.
>> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
>> Also multi-column index(composite index) needs to be considered.
>> 
>> Regards,
>> Ramana
>> 
>> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
>> 
>>> Hi community,
>>> 
>>>   Currently CarbonData have builtin index support which is one of the key
>>> strength of CarbonData. Using index, CarbonData can do very fast filter
>>> query by filtering on block and blocklet level. However, it also introduces
>>> memory consumption of the index tree and impact first query time because
>>> the
>>> process of loading of index from file footer into memory. On the other
>>> side,
>>> in a multi-tennant environment, multiple applications may access data files
>>> simultaneously, which again exacerbate this resource consumption issue.
>>>   So, I want to propose and discuss a solution with you to solve this
>>> problem and make an abstraction of interface for CarbonData's future
>>> evolvement.
>>>   I am thinking the final result of this work should achieve at least two
>>> goals:
>>> 
>>> Goal 1: User can choose the place to store Index data, it can be stored in
>>> processing framework's memory space (like in spark driver memory) or in
>>> another service outside of the processing framework (like using a
>>> independent database service)
>>> 
>>> Goal 2: Developer can add more index of his choice to CarbonData files.
>>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>>> developers are free to add other indexing technology to make certain
>>> workload faster. These new indices should be added in a pluggable way.
>>> 
>>>    In order to achieve these goals, an abstraction need to be created for
>>> CarbonData project, including:
>>> 
>>> - Segment: each segment is presenting one load of data, and tie with some
>>> indices created with this load
>>> 
>>> - Index: index is created when this segment is created, and is leveraged
>>> when CarbonInputFormat's getSplit is called, to filter out the required
>>> blocks or even blocklets.
>>> 
>>> - CarbonInputFormat: There maybe n number of indices created for data file,
>>> when querying these data files, InputFormat should know how to access these
>>> indices, and initialize or loading these index if required.
>>> 
>>>   Obviously, this work should be separated into different tasks and
>>> implemented gradually. But first of all, let's discuss on the goal and the
>>> proposed approach. What is your idea?
>>> 
>>> Regards,
>>> Jacky
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-carbondata-
>>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>>> CarbonData-s-Index-Interface-tp1587.html
>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>> at Nabble.com.
>>> 
>

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

I am currently thinking these abstractions:

- A SegmentManager is the global manager of all segments for one table. It can be used to get all segments and manage the segment while loading and compaction.
- A CarbonInputFormat will take the input of table path, so means it represent the whole table contain all segments.  When getSplit is called, it will get all segments by calling SegmentManager interface.
- Each Segment contains a list of Index, and an IndexSelector. While currently CarbonData only has MDK index, developer can create multiple indices for each segment in the future.
- An Index is an interface to filtering on block/blocklet, and provide this functionality only.  Implementation should hide all complexity like deciding where to store the index.
- An IndexSelector is an interface to choose which index to use based on query predicates. Default implementation is to choose the first index. An implementation of IndexChooser can also decide not to use index at all.
- A Distributor is used to map the filtered block/blocklet to InputSplits. Implementation can take number of node, parallelism into consideration. It can also decide to distribute tasks based on block or blocklet.

So the main concepts are SegmentManager, Segment, Index, IndexSelector, InputFormat/OutputFormat, Distributor.

There will be a default implementation of CarbonInputFormat whose getSplit will do the following: 
1. gat all segments by calling SegmentManager
2. for each segment, choose the index to use by IndexSelector
3. invoke the selected Index to filter out block/blocklet (since these are two concept, maybe a parent class need to be created to encapsulate them)
4. distribute the filtered block/blocklet to InputSplits by Distributor.

Regarding the input to the Index.filter interface, I have not decided to use the existing QueryModel or create a new cleaner QueryModel interface. If new QueryModel is desired, it should only contain filter predicate and project columns, so it is much simpler than current QueryModel. But I see current QueryModel is used in Compaction also, so I think it is better to do this clean up later?


Does this look fine to you? Any suggestion is welcome.

Regards,
Jacky


> 在 2016年10月3日，上午2:18，Venkata Gollamudi <g....@gmail.com> 写道：
> 
> Yes Jacky, interfaces needs to be revisited.
> For Goal 1 and Goal 2: abstraction required for both Index and Index store.
> Also multi-column index(composite index) needs to be considered.
> 
> Regards,
> Ramana
> 
> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:
> 
>> Hi community,
>> 
>>    Currently CarbonData have builtin index support which is one of the key
>> strength of CarbonData. Using index, CarbonData can do very fast filter
>> query by filtering on block and blocklet level. However, it also introduces
>> memory consumption of the index tree and impact first query time because
>> the
>> process of loading of index from file footer into memory. On the other
>> side,
>> in a multi-tennant environment, multiple applications may access data files
>> simultaneously, which again exacerbate this resource consumption issue.
>>    So, I want to propose and discuss a solution with you to solve this
>> problem and make an abstraction of interface for CarbonData's future
>> evolvement.
>>    I am thinking the final result of this work should achieve at least two
>> goals:
>> 
>> Goal 1: User can choose the place to store Index data, it can be stored in
>> processing framework's memory space (like in spark driver memory) or in
>> another service outside of the processing framework (like using a
>> independent database service)
>> 
>> Goal 2: Developer can add more index of his choice to CarbonData files.
>> Besides B+ tree on multi-dimensional key which current CarbonData supports,
>> developers are free to add other indexing technology to make certain
>> workload faster. These new indices should be added in a pluggable way.
>> 
>>     In order to achieve these goals, an abstraction need to be created for
>> CarbonData project, including:
>> 
>> - Segment: each segment is presenting one load of data, and tie with some
>> indices created with this load
>> 
>> - Index: index is created when this segment is created, and is leveraged
>> when CarbonInputFormat's getSplit is called, to filter out the required
>> blocks or even blocklets.
>> 
>> - CarbonInputFormat: There maybe n number of indices created for data file,
>> when querying these data files, InputFormat should know how to access these
>> indices, and initialize or loading these index if required.
>> 
>>    Obviously, this work should be separated into different tasks and
>> implemented gradually. But first of all, let's discuss on the goal and the
>> proposed approach. What is your idea?
>> 
>> Regards,
>> Jacky
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
>> CarbonData-s-Index-Interface-tp1587.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>

Re: Abstracting CarbonData's Index Interface

Posted by Venkata Gollamudi <g....@gmail.com>.

Yes Jacky, interfaces needs to be revisited.
For Goal 1 and Goal 2: abstraction required for both Index and Index store.
Also multi-column index(composite index) needs to be considered.

Regards,
Ramana

On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <ja...@qq.com> wrote:

> Hi community,
>
>     Currently CarbonData have builtin index support which is one of the key
> strength of CarbonData. Using index, CarbonData can do very fast filter
> query by filtering on block and blocklet level. However, it also introduces
> memory consumption of the index tree and impact first query time because
> the
> process of loading of index from file footer into memory. On the other
> side,
> in a multi-tennant environment, multiple applications may access data files
> simultaneously, which again exacerbate this resource consumption issue.
>     So, I want to propose and discuss a solution with you to solve this
> problem and make an abstraction of interface for CarbonData's future
> evolvement.
>     I am thinking the final result of this work should achieve at least two
> goals:
>
> Goal 1: User can choose the place to store Index data, it can be stored in
> processing framework's memory space (like in spark driver memory) or in
> another service outside of the processing framework (like using a
> independent database service)
>
> Goal 2: Developer can add more index of his choice to CarbonData files.
> Besides B+ tree on multi-dimensional key which current CarbonData supports,
> developers are free to add other indexing technology to make certain
> workload faster. These new indices should be added in a pluggable way.
>
>      In order to achieve these goals, an abstraction need to be created for
> CarbonData project, including:
>
> - Segment: each segment is presenting one load of data, and tie with some
> indices created with this load
>
> - Index: index is created when this segment is created, and is leveraged
> when CarbonInputFormat's getSplit is called, to filter out the required
> blocks or even blocklets.
>
> - CarbonInputFormat: There maybe n number of indices created for data file,
> when querying these data files, InputFormat should know how to access these
> indices, and initialize or loading these index if required.
>
>     Obviously, this work should be separated into different tasks and
> implemented gradually. But first of all, let's discuss on the goal and the
> proposed approach. What is your idea?
>
> Regards,
> Jacky
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Re: Abstracting CarbonData's Index Interface

Posted by Qingqing Zhou <zh...@gmail.com>.

On Mon, Oct 3, 2016 at 8:52 PM, Jacky Li <ja...@qq.com> wrote:
> I think we can try to reuse anything except for Index storage, like
> segment management, query logic processing after InputSplit is gathered
> by calling index interface.I think index can be programmed in different
> level, what I proposed here is still a block level solution, so it can
> be processed in InputFormat level.

I agree the scan provider is at InputFormat level, this is the same for
wherever you store the index. What the discrepancy I see here is the
implementation of the index itself: if you use "database service" to store
your index, you can simple invoke an "CRAETE INDEX" statement to implment
indexing, but if you want to store index in Carbon, you will need to
implment the btree yourself.

Agree "segment management" can be shared, as it is for the indexed data.
About "query logic processing": Currently Carbon pushes certain SARGs into
storage level: so in above picture, these two implementations won't be
able to share this logic: the "data service" one will rely on the query
processor (currently Spark) to tell how to use index, while "carbon" one
will handle it internally. To change this, we will have to expose "carbon"
index to query processor level.

Regards,
Qingqing

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li <ja...@qq.com>.

> 在 2016年10月4日，上午5:43，Qingqing Zhou <zh...@gmail.com> 写道：
> 
> On Fri, Sep 30, 2016 at 10:31 PM, Jacky Li <ja...@qq.com> wrote:
>> However, it also introduces memory consumption of the index tree and
>> impact first query time because the process of loading of index from
>> file footer into memory. On the other side, in a multi-tennant
>> environment, multiple applications may access data files simultaneously,
>> which again exacerbate this resource consumption issue.
>> 
> Agree we shall at least not rely so much on driver memory for indexing.
> 
>> 
>> Goal 1: User can choose the place to store Index data, it can be stored
>> in processing framework's memory space (like in spark driver memory) or
>> in another service outside of the processing framework (like using a
>> independent database service)
>> 
> 
> How much will be the same index on different "places" code shared? For
> example, for Btree index, if you do it inside Carbon, you are programming
> at block level and you will worry about block [de]allocation, tree balance
> etc. But if you rely on a database service, you programming at table
> level, which you are programming with relational table/index. Meanwhile,
> index is essentially a data redundancy, so updates needs careful design if
> the index is outside of your control.
> 

I think we can try to reuse anything except for Index storage, like segment management, query logic processing after InputSplit is gathered by calling index interface. 
I think index can be programmed in different level, what I proposed here is still a block level solution, so it can be processed in InputFormat level. If you are looking for table level indexing solution, it means that you need to manipulate the query plan to do some kind of join of two tables, so means we need to add logic into processing framework’s optimizer which I tend to avoid in CarbonData project, unless it has huge benefits. Because every optimizer is having different interface, there are no *standard* way to do it right now. Do you see any benefit of doing it in table level?

> Regards,
> Qingqing
>

Re: Abstracting CarbonData's Index Interface

Posted by Qingqing Zhou <zh...@gmail.com>.

On Fri, Sep 30, 2016 at 10:31 PM, Jacky Li <ja...@qq.com> wrote:
> However, it also introduces memory consumption of the index tree and
> impact first query time because the process of loading of index from
> file footer into memory. On the other side, in a multi-tennant
> environment, multiple applications may access data files simultaneously,
> which again exacerbate this resource consumption issue.
>
Agree we shall at least not rely so much on driver memory for indexing.

>
> Goal 1: User can choose the place to store Index data, it can be stored
> in processing framework's memory space (like in spark driver memory) or
> in another service outside of the processing framework (like using a
> independent database service)
>

How much will be the same index on different "places" code shared? For
example, for Btree index, if you do it inside Carbon, you are programming
at block level and you will worry about block [de]allocation, tree balance
etc. But if you rely on a database service, you programming at table
level, which you are programming with relational table/index. Meanwhile,
index is essentially a data redundancy, so updates needs careful design if
the index is outside of your control.

Regards,
Qingqing