You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Indhumathi <in...@gmail.com> on 2020/07/30 07:19:01 UTC

[Discussion] SI support Complex Array Type

Hi community,

Currently, i am working on supporting SI with complex array type.
In order to support it, we must decide, how we can store Array type 
in SI, to get better performance.

Solution 1:
Store Array as complex(ARRAY) type in secondary index table.

Cons:
Pruning arrays of huge data on SI and maintable will be an overhead
 and might not give much performance results.

Solution 2:
Make Array data as flattened and store it as its child DataType in secondary
index table, which can provide benefit in some scenarios, compared to
solution 1.
(i have raised a PR with this solution). On first level, only one level of
Array 
will be supported.

And also, with this solution, added support to prune SI on rowId(keeping
position id 
till rowId,instead of blockletId), with complex types for better
performance.

Cons:
With this solution, query having more than one array_contains filter
with expressions like AND, cannot be supported on SI, since the array data
will
flattened in SI.

Inputs and suggestions for any new solution/ changes in above solution are
most welcomed.

Regards,
Indhumathi
 



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] SI support Complex Array Type

Posted by Kumar Vishal <ku...@gmail.com>.

+1 for solution 2

Regards
Kumar Vishal


On Thu, 30 Jul 2020 at 3:19 PM, Indhumathi <in...@gmail.com> wrote:

> Hi community,
>
> Currently, i am working on supporting SI with complex array type.
> In order to support it, we must decide, how we can store Array type
> in SI, to get better performance.
>
> Solution 1:
> Store Array as complex(ARRAY) type in secondary index table.
>
> Cons:
> Pruning arrays of huge data on SI and maintable will be an overhead
>  and might not give much performance results.
>
> Solution 2:
> Make Array data as flattened and store it as its child DataType in
> secondary
> index table, which can provide benefit in some scenarios, compared to
> solution 1.
> (i have raised a PR with this solution). On first level, only one level of
> Array
> will be supported.
>
> And also, with this solution, added support to prune SI on rowId(keeping
> position id
> till rowId,instead of blockletId), with complex types for better
> performance.
>
> Cons:
> With this solution, query having more than one array_contains filter
> with expressions like AND, cannot be supported on SI, since the array data
> will
> flattened in SI.
>
> Inputs and suggestions for any new solution/ changes in above solution are
> most welcomed.
>
> Regards,
> Indhumathi
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Discussion] SI support Complex Array Type

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi All,

+1 for solution 2. But don't store rowid as it makes the storage very big
and it gives a very slow performance. Let's go with the current model of SI
which stores till blocklet level. Don't make things complicated by storing
rowid.
Solution 1 makes the scan slower as it needs to construct the complex row
for every row. So it is better to flatten out to get the better scan
performance and storage optimization.

Consider the following way.
*Array: *Flatten out each row and store in multiple rows and store till
blocklet id.
*Struct:* It is up to the user on which element exactly he wants to index.
For example *emp:struct<name: String, address: String>*, in this user can
create separate SI on individual columns like *emp.name <http://emp.name>*
or *emp.address*.
*Map:* Here also we can flatten out the data like Array. But the user
should choose whether he wants the SI on Map key or value. If he wants both
then he can create separate SI.

Regards,
Ravindra.

On Thu, 30 Jul 2020 at 17:35, Ajantha Bhat <aj...@gmail.com> wrote:

> Hi David & Indhumathi,
> Storing Array of String as just String column in SI by flattening [with row
> level position reference] can result in slow performance in case of
> * Multiple array_contains() or multiple array[0] = 'x'
> * The join solution mentioned can result in multiple scan (once for every
> complex filter condition) which can slow down the SI performance.
> * Row level SI can slow down SI performance when the filter results huge
> value.
> * To support multiple SI on a single table, complex SI will become row
> level position reference and primitive will become blocklet level position
> reference. Need extra logic /time for join.
> * Solution 2 cannot support struct column SI in the future. So, it cannot
> be a generic solution.
>
> Considering the above points, *solution2 is a very good solution if only
> one filter exist* for complex column. *But not a good solution for all the
> scenarios.*
>
> *So, I have to go with solution1 or need to wait for other people opinions
> or new solutions.*
>
> Thanks,
> Ajantha
>
> On Thu, Jul 30, 2020 at 1:19 PM David CaiQiang <da...@gmail.com>
> wrote:
>
> > +1 for solution2
> >
> > Can we support more than one array_contains by using SI join (like SI on
> > primitive data type)?
> >
> >
> >
> > -----
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


-- 
Thanks & Regards,
Ravi

Re: [Discussion] SI support Complex Array Type

Posted by Ajantha Bhat <aj...@gmail.com>.

Hi David & Indhumathi,
Storing Array of String as just String column in SI by flattening [with row
level position reference] can result in slow performance in case of
* Multiple array_contains() or multiple array[0] = 'x'
* The join solution mentioned can result in multiple scan (once for every
complex filter condition) which can slow down the SI performance.
* Row level SI can slow down SI performance when the filter results huge
value.
* To support multiple SI on a single table, complex SI will become row
level position reference and primitive will become blocklet level position
reference. Need extra logic /time for join.
* Solution 2 cannot support struct column SI in the future. So, it cannot
be a generic solution.

Considering the above points, *solution2 is a very good solution if only
one filter exist* for complex column. *But not a good solution for all the
scenarios.*

*So, I have to go with solution1 or need to wait for other people opinions
or new solutions.*

Thanks,
Ajantha

On Thu, Jul 30, 2020 at 1:19 PM David CaiQiang <da...@gmail.com> wrote:

> +1 for solution2
>
> Can we support more than one array_contains by using SI join (like SI on
> primitive data type)?
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Discussion] SI support Complex Array Type

Posted by David CaiQiang <da...@gmail.com>.

+1 for solution2

Can we support more than one array_contains by using SI join (like SI on
primitive data type)?



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/