You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Atul Saroha <at...@snapdeal.com> on 2016/05/10 11:41:31 UTC

Low cardinality secondary index behaviour

I have concern over using secondary index on field with low cardinality.
Lets say I have few billion rows and each row can be classified in 1000
category. Lets say we have 50 node cluster.

Now we want to fetch data for a single category using secondary index over
a category. And query is paginated too with fetch size property say 5000.

Since query on secondary index works as scatter and gatherer approach by
coordinator node. Would it lead to out of memory on coordinator or timeout
errors too much.

How does pagination (token level data fetch) behave in scatter and gatherer
approach?

Secondly, What If we create an inverted table with partition key as
category. Then this will led to lots of data on single node. Then it might
led to hot shard issue and performance issue of data fetching from single
node as a single partition has  millions of rows.

How should we tackle such low cardinality index in Cassandra?

Thanks
---------------------------------------------------------------------------------------------------------------------
Atul Saroha
*Lead Software Engineer*

Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

Re: Low cardinality secondary index behaviour

Posted by DuyHai Doan <do...@gmail.com>.
Cassandra 3.0.6 does not have SASI. SASI is available only from C* 3.4 but
I advise C* 3.5/3.6 because some critical bugs have been fixed in 3.5

On Wed, May 18, 2016 at 1:58 PM, Atul Saroha <at...@snapdeal.com>
wrote:

> Thanks Tyler,
>
> SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
> 3.0.6 now.
>
>
> ---------------------------------------------------------------------------------------------------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>
>>
>> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <at...@snapdeal.com>
>> wrote:
>>
>>> I have concern over using secondary index on field with low cardinality.
>>> Lets say I have few billion rows and each row can be classified in 1000
>>> category. Lets say we have 50 node cluster.
>>>
>>> Now we want to fetch data for a single category using secondary index
>>> over a category. And query is paginated too with fetch size property say
>>> 5000.
>>>
>>> Since query on secondary index works as scatter and gatherer approach by
>>> coordinator node. Would it lead to out of memory on coordinator or timeout
>>> errors too much.
>>>
>>
>> Paging will prevent the coordinator from using excessive memory.  With
>> the type of data that you described, timeouts shouldn't be huge problem
>> because it will only take a few token ranges (assuming you're using vnodes)
>> to get enough matching rows to hit the page size.
>>
>>
>>>
>>> How does pagination (token level data fetch) behave in scatter and
>>> gatherer approach?
>>>
>>
>> Secondary index queries fetch token ranges in sequential order [1],
>> starting with the minimum token.  When you fetch a new page, it resumes
>> from the last token (and primary key) that it returned in the previous page.
>>
>> [1] As an optimization, multiple token ranges will be fetched in parallel
>> based on estimates of how many token ranges it will take to fill the page.
>>
>>
>>>
>>> Secondly, What If we create an inverted table with partition key as
>>> category. Then this will led to lots of data on single node. Then it might
>>> led to hot shard issue and performance issue of data fetching from single
>>> node as a single partition has  millions of rows.
>>>
>>> How should we tackle such low cardinality index in Cassandra?
>>
>>
>> The data distribution that you described sounds like a reasonable fit for
>> secondary indexes.  However, I would also take into account how frequently
>> you run this query and how fast you need it to be.  Even ignoring the
>> scatter-gather aspects of a secondary index query, they are still expensive
>> because they fetch many non-contiguous rows from an SSTable.  If you need
>> to run this query very frequently, that may add too much load to your
>> cluster, and some sort of inverted table approach may be more appropriate.
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>
>

Re: Low cardinality secondary index behaviour

Posted by Atul Saroha <at...@snapdeal.com>.
Thanks Tyler,

SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
3.0.6 now.

---------------------------------------------------------------------------------------------------------------------
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <at...@snapdeal.com>
> wrote:
>
>> I have concern over using secondary index on field with low cardinality.
>> Lets say I have few billion rows and each row can be classified in 1000
>> category. Lets say we have 50 node cluster.
>>
>> Now we want to fetch data for a single category using secondary index
>> over a category. And query is paginated too with fetch size property say
>> 5000.
>>
>> Since query on secondary index works as scatter and gatherer approach by
>> coordinator node. Would it lead to out of memory on coordinator or timeout
>> errors too much.
>>
>
> Paging will prevent the coordinator from using excessive memory.  With the
> type of data that you described, timeouts shouldn't be huge problem because
> it will only take a few token ranges (assuming you're using vnodes) to get
> enough matching rows to hit the page size.
>
>
>>
>> How does pagination (token level data fetch) behave in scatter and
>> gatherer approach?
>>
>
> Secondary index queries fetch token ranges in sequential order [1],
> starting with the minimum token.  When you fetch a new page, it resumes
> from the last token (and primary key) that it returned in the previous page.
>
> [1] As an optimization, multiple token ranges will be fetched in parallel
> based on estimates of how many token ranges it will take to fill the page.
>
>
>>
>> Secondly, What If we create an inverted table with partition key as
>> category. Then this will led to lots of data on single node. Then it might
>> led to hot shard issue and performance issue of data fetching from single
>> node as a single partition has  millions of rows.
>>
>> How should we tackle such low cardinality index in Cassandra?
>
>
> The data distribution that you described sounds like a reasonable fit for
> secondary indexes.  However, I would also take into account how frequently
> you run this query and how fast you need it to be.  Even ignoring the
> scatter-gather aspects of a secondary index query, they are still expensive
> because they fetch many non-contiguous rows from an SSTable.  If you need
> to run this query very frequently, that may add too much load to your
> cluster, and some sort of inverted table approach may be more appropriate.
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Low cardinality secondary index behaviour

Posted by Tyler Hobbs <ty...@datastax.com>.
On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <at...@snapdeal.com>
wrote:

> I have concern over using secondary index on field with low cardinality.
> Lets say I have few billion rows and each row can be classified in 1000
> category. Lets say we have 50 node cluster.
>
> Now we want to fetch data for a single category using secondary index over
> a category. And query is paginated too with fetch size property say 5000.
>
> Since query on secondary index works as scatter and gatherer approach by
> coordinator node. Would it lead to out of memory on coordinator or timeout
> errors too much.
>

Paging will prevent the coordinator from using excessive memory.  With the
type of data that you described, timeouts shouldn't be huge problem because
it will only take a few token ranges (assuming you're using vnodes) to get
enough matching rows to hit the page size.


>
> How does pagination (token level data fetch) behave in scatter and
> gatherer approach?
>

Secondary index queries fetch token ranges in sequential order [1],
starting with the minimum token.  When you fetch a new page, it resumes
from the last token (and primary key) that it returned in the previous page.

[1] As an optimization, multiple token ranges will be fetched in parallel
based on estimates of how many token ranges it will take to fill the page.


>
> Secondly, What If we create an inverted table with partition key as
> category. Then this will led to lots of data on single node. Then it might
> led to hot shard issue and performance issue of data fetching from single
> node as a single partition has  millions of rows.
>
> How should we tackle such low cardinality index in Cassandra?


The data distribution that you described sounds like a reasonable fit for
secondary indexes.  However, I would also take into account how frequently
you run this query and how fast you need it to be.  Even ignoring the
scatter-gather aspects of a secondary index query, they are still expensive
because they fetch many non-contiguous rows from an SSTable.  If you need
to run this query very frequently, that may add too much load to your
cluster, and some sort of inverted table approach may be more appropriate.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>