You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Kant Kodali <ka...@peernova.com> on 2016/10/15 06:37:14 UTC

Is SASI index in Cassandra efficient for high cardinality columns?

I understand Secondary Indexes in general are inefficient on high
cardinality columns but since SASI is built from scratch I wonder if the
same argument applies there? If not, Why? Because I believe primary keys in
Cassandra are indeed indexed and since Primary key is supposed to be the
column with highest cardinality why not do the same for secondary indexes?

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by DuyHai Doan <do...@gmail.com>.
If you read my blog post about 2nd index deep dive, you'll get all the
answers
Le 21 oct. 2016 10:20, "Kant Kodali" <ka...@peernova.com> a écrit :

> Why Secondary index cannot be broken down into token ranges like primary
> index at least for exact matches? That way dont need to scan the whole
> cluster atleast for exact matches. I understand if it is a substring search
> then there will 2^n substrings which equates to 2^n hashes/tokens which can
> be a lot!
>
> On Sat, Oct 15, 2016 at 4:35 AM, DuyHai Doan <do...@gmail.com> wrote:
>
> > If each indexed value has very few matching rows, then querying using
> SASI
> > (or any impl of secondary index) may scan the whole cluster.
> >
> > This is because the index are "distributed" e.g. the indexed values stay
> > on the same nodes as the base data. And even SASI with its own
> > data-structure will not help much here.
> >
> > One should understand that the 2nd index query has to deal with 2 layers:
> >
> > 1) The cluster layer, which is common for any impl of 2nd index. Read my
> > blog post here: http://www.planetcassandra.org/blog/
> > cassandra-native-secondary-index-deep-dive/
> >
> > 2) The local read path, which depends on the impl of 2nd index. Some are
> > using Lucene library like Stratio impl, some rolls in its own data
> > structures like SASI
> >
> > If you have a 1-to-1 relationship between the index value and the
> matching
> > row (or 1-to-a few), I would recommend using materialized views instead:
> >
> > http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> > the-full-text-search-ride-voxxed-daybelgrade-2016/25
> >
> > Materialized views guarantee that for each search indexed value, you only
> > hit a single node (or N replicas depending on the used consistency level)
> >
> > However, materialized views have their own drawbacks (weeker consistency
> > guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
> > search on the indexed value
> >
> >
> >
> >
> >
> > On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali <ka...@peernova.com> wrote:
> >
> >> Well I went with the definition from wikipedia and that definition rules
> >> out #1 so it is #2 and it is just one matching row in my case.
> >>
> >>
> >>
> >> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com>
> >> wrote:
> >>
> >> > Define precisely what you mean by "high cardinality columns". Do you
> >> mean:
> >> >
> >> > 1) a single indexed value is present in a lot of rows
> >> > 2) a single indexed value has only a few (if not just one) matching
> row
> >> >
> >> >
> >> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com>
> wrote:
> >> >
> >> >> I understand Secondary Indexes in general are inefficient on high
> >> >> cardinality columns but since SASI is built from scratch I wonder if
> >> the
> >> >> same argument applies there? If not, Why? Because I believe primary
> >> keys in
> >> >> Cassandra are indeed indexed and since Primary key is supposed to be
> >> the
> >> >> column with highest cardinality why not do the same for secondary
> >> indexes?
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by Kant Kodali <ka...@peernova.com>.
Why Secondary index cannot be broken down into token ranges like primary
index at least for exact matches? That way dont need to scan the whole
cluster atleast for exact matches. I understand if it is a substring search
then there will 2^n substrings which equates to 2^n hashes/tokens which can
be a lot!

On Sat, Oct 15, 2016 at 4:35 AM, DuyHai Doan <do...@gmail.com> wrote:

> If each indexed value has very few matching rows, then querying using SASI
> (or any impl of secondary index) may scan the whole cluster.
>
> This is because the index are "distributed" e.g. the indexed values stay
> on the same nodes as the base data. And even SASI with its own
> data-structure will not help much here.
>
> One should understand that the 2nd index query has to deal with 2 layers:
>
> 1) The cluster layer, which is common for any impl of 2nd index. Read my
> blog post here: http://www.planetcassandra.org/blog/
> cassandra-native-secondary-index-deep-dive/
>
> 2) The local read path, which depends on the impl of 2nd index. Some are
> using Lucene library like Stratio impl, some rolls in its own data
> structures like SASI
>
> If you have a 1-to-1 relationship between the index value and the matching
> row (or 1-to-a few), I would recommend using materialized views instead:
>
> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> the-full-text-search-ride-voxxed-daybelgrade-2016/25
>
> Materialized views guarantee that for each search indexed value, you only
> hit a single node (or N replicas depending on the used consistency level)
>
> However, materialized views have their own drawbacks (weeker consistency
> guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
> search on the indexed value
>
>
>
>
>
> On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali <ka...@peernova.com> wrote:
>
>> Well I went with the definition from wikipedia and that definition rules
>> out #1 so it is #2 and it is just one matching row in my case.
>>
>>
>>
>> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com>
>> wrote:
>>
>> > Define precisely what you mean by "high cardinality columns". Do you
>> mean:
>> >
>> > 1) a single indexed value is present in a lot of rows
>> > 2) a single indexed value has only a few (if not just one) matching row
>> >
>> >
>> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
>> >
>> >> I understand Secondary Indexes in general are inefficient on high
>> >> cardinality columns but since SASI is built from scratch I wonder if
>> the
>> >> same argument applies there? If not, Why? Because I believe primary
>> keys in
>> >> Cassandra are indeed indexed and since Primary key is supposed to be
>> the
>> >> column with highest cardinality why not do the same for secondary
>> indexes?
>> >>
>> >
>> >
>>
>
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by Kant Kodali <ka...@peernova.com>.
Why Secondary index cannot be broken down into token ranges like primary
index at least for exact matches? That way dont need to scan the whole
cluster atleast for exact matches. I understand if it is a substring search
then there will 2^n substrings which equates to 2^n hashes/tokens which can
be a lot!

On Sat, Oct 15, 2016 at 4:35 AM, DuyHai Doan <do...@gmail.com> wrote:

> If each indexed value has very few matching rows, then querying using SASI
> (or any impl of secondary index) may scan the whole cluster.
>
> This is because the index are "distributed" e.g. the indexed values stay
> on the same nodes as the base data. And even SASI with its own
> data-structure will not help much here.
>
> One should understand that the 2nd index query has to deal with 2 layers:
>
> 1) The cluster layer, which is common for any impl of 2nd index. Read my
> blog post here: http://www.planetcassandra.org/blog/
> cassandra-native-secondary-index-deep-dive/
>
> 2) The local read path, which depends on the impl of 2nd index. Some are
> using Lucene library like Stratio impl, some rolls in its own data
> structures like SASI
>
> If you have a 1-to-1 relationship between the index value and the matching
> row (or 1-to-a few), I would recommend using materialized views instead:
>
> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> the-full-text-search-ride-voxxed-daybelgrade-2016/25
>
> Materialized views guarantee that for each search indexed value, you only
> hit a single node (or N replicas depending on the used consistency level)
>
> However, materialized views have their own drawbacks (weeker consistency
> guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
> search on the indexed value
>
>
>
>
>
> On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali <ka...@peernova.com> wrote:
>
>> Well I went with the definition from wikipedia and that definition rules
>> out #1 so it is #2 and it is just one matching row in my case.
>>
>>
>>
>> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com>
>> wrote:
>>
>> > Define precisely what you mean by "high cardinality columns". Do you
>> mean:
>> >
>> > 1) a single indexed value is present in a lot of rows
>> > 2) a single indexed value has only a few (if not just one) matching row
>> >
>> >
>> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
>> >
>> >> I understand Secondary Indexes in general are inefficient on high
>> >> cardinality columns but since SASI is built from scratch I wonder if
>> the
>> >> same argument applies there? If not, Why? Because I believe primary
>> keys in
>> >> Cassandra are indeed indexed and since Primary key is supposed to be
>> the
>> >> column with highest cardinality why not do the same for secondary
>> indexes?
>> >>
>> >
>> >
>>
>
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by DuyHai Doan <do...@gmail.com>.
If each indexed value has very few matching rows, then querying using SASI
(or any impl of secondary index) may scan the whole cluster.

This is because the index are "distributed" e.g. the indexed values stay on
the same nodes as the base data. And even SASI with its own data-structure
will not help much here.

One should understand that the 2nd index query has to deal with 2 layers:

1) The cluster layer, which is common for any impl of 2nd index. Read my
blog post here:
http://www.planetcassandra.org/blog/cassandra-native-secondary-index-deep-dive/

2) The local read path, which depends on the impl of 2nd index. Some are
using Lucene library like Stratio impl, some rolls in its own data
structures like SASI

If you have a 1-to-1 relationship between the index value and the matching
row (or 1-to-a few), I would recommend using materialized views instead:

http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-full-text-search-ride-voxxed-daybelgrade-2016/25

Materialized views guarantee that for each search indexed value, you only
hit a single node (or N replicas depending on the used consistency level)

However, materialized views have their own drawbacks (weeker consistency
guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
search on the indexed value





On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali <ka...@peernova.com> wrote:

> Well I went with the definition from wikipedia and that definition rules
> out #1 so it is #2 and it is just one matching row in my case.
>
>
>
> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com> wrote:
>
> > Define precisely what you mean by "high cardinality columns". Do you
> mean:
> >
> > 1) a single indexed value is present in a lot of rows
> > 2) a single indexed value has only a few (if not just one) matching row
> >
> >
> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
> >
> >> I understand Secondary Indexes in general are inefficient on high
> >> cardinality columns but since SASI is built from scratch I wonder if the
> >> same argument applies there? If not, Why? Because I believe primary
> keys in
> >> Cassandra are indeed indexed and since Primary key is supposed to be the
> >> column with highest cardinality why not do the same for secondary
> indexes?
> >>
> >
> >
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by DuyHai Doan <do...@gmail.com>.
If each indexed value has very few matching rows, then querying using SASI
(or any impl of secondary index) may scan the whole cluster.

This is because the index are "distributed" e.g. the indexed values stay on
the same nodes as the base data. And even SASI with its own data-structure
will not help much here.

One should understand that the 2nd index query has to deal with 2 layers:

1) The cluster layer, which is common for any impl of 2nd index. Read my
blog post here:
http://www.planetcassandra.org/blog/cassandra-native-secondary-index-deep-dive/

2) The local read path, which depends on the impl of 2nd index. Some are
using Lucene library like Stratio impl, some rolls in its own data
structures like SASI

If you have a 1-to-1 relationship between the index value and the matching
row (or 1-to-a few), I would recommend using materialized views instead:

http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-full-text-search-ride-voxxed-daybelgrade-2016/25

Materialized views guarantee that for each search indexed value, you only
hit a single node (or N replicas depending on the used consistency level)

However, materialized views have their own drawbacks (weeker consistency
guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
search on the indexed value





On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali <ka...@peernova.com> wrote:

> Well I went with the definition from wikipedia and that definition rules
> out #1 so it is #2 and it is just one matching row in my case.
>
>
>
> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com> wrote:
>
> > Define precisely what you mean by "high cardinality columns". Do you
> mean:
> >
> > 1) a single indexed value is present in a lot of rows
> > 2) a single indexed value has only a few (if not just one) matching row
> >
> >
> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
> >
> >> I understand Secondary Indexes in general are inefficient on high
> >> cardinality columns but since SASI is built from scratch I wonder if the
> >> same argument applies there? If not, Why? Because I believe primary
> keys in
> >> Cassandra are indeed indexed and since Primary key is supposed to be the
> >> column with highest cardinality why not do the same for secondary
> indexes?
> >>
> >
> >
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by Kant Kodali <ka...@peernova.com>.
Well I went with the definition from wikipedia and that definition rules
out #1 so it is #2 and it is just one matching row in my case.



On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com> wrote:

> Define precisely what you mean by "high cardinality columns". Do you mean:
>
> 1) a single indexed value is present in a lot of rows
> 2) a single indexed value has only a few (if not just one) matching row
>
>
> On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
>
>> I understand Secondary Indexes in general are inefficient on high
>> cardinality columns but since SASI is built from scratch I wonder if the
>> same argument applies there? If not, Why? Because I believe primary keys in
>> Cassandra are indeed indexed and since Primary key is supposed to be the
>> column with highest cardinality why not do the same for secondary indexes?
>>
>
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by Kant Kodali <ka...@peernova.com>.
Well I went with the definition from wikipedia and that definition rules
out #1 so it is #2 and it is just one matching row in my case.



On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan <do...@gmail.com> wrote:

> Define precisely what you mean by "high cardinality columns". Do you mean:
>
> 1) a single indexed value is present in a lot of rows
> 2) a single indexed value has only a few (if not just one) matching row
>
>
> On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:
>
>> I understand Secondary Indexes in general are inefficient on high
>> cardinality columns but since SASI is built from scratch I wonder if the
>> same argument applies there? If not, Why? Because I believe primary keys in
>> Cassandra are indeed indexed and since Primary key is supposed to be the
>> column with highest cardinality why not do the same for secondary indexes?
>>
>
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by DuyHai Doan <do...@gmail.com>.
Define precisely what you mean by "high cardinality columns". Do you mean:

1) a single indexed value is present in a lot of rows
2) a single indexed value has only a few (if not just one) matching row


On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:

> I understand Secondary Indexes in general are inefficient on high
> cardinality columns but since SASI is built from scratch I wonder if the
> same argument applies there? If not, Why? Because I believe primary keys in
> Cassandra are indeed indexed and since Primary key is supposed to be the
> column with highest cardinality why not do the same for secondary indexes?
>

Re: Is SASI index in Cassandra efficient for high cardinality columns?

Posted by DuyHai Doan <do...@gmail.com>.
Define precisely what you mean by "high cardinality columns". Do you mean:

1) a single indexed value is present in a lot of rows
2) a single indexed value has only a few (if not just one) matching row


On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali <ka...@peernova.com> wrote:

> I understand Secondary Indexes in general are inefficient on high
> cardinality columns but since SASI is built from scratch I wonder if the
> same argument applies there? If not, Why? Because I believe primary keys in
> Cassandra are indeed indexed and since Primary key is supposed to be the
> column with highest cardinality why not do the same for secondary indexes?
>