You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jing Meng <se...@gmail.com> on 2018/10/22 13:01:51 UTC

Wondering how cql3 DISTINCT query is implemented

Hi, we built a simple system to migrate live cassandra data to other
databases, mainly by using these queries:

1. SELECT DISTINCT TOKEN(partition_key) FROM table WHERE
TOKEN(partition_key) > current_offset AND TOKEN(partition_key) <=
upper_bound LIMIT token_fetch_size
2. Any cql query that retrieves all rows, given a set of tokens

And we observed that the "SELECT DISTINCT TOKEN" query takes way longer
when the table is wide partitioned (about 200+ rows on average), look like
the underlying operation is not linear.

Is it that the query would scan every rows of every partitions found until
token_fetch_size is met? Or is it due to some low-level operations that are
naturally more time consuming when dealing with wide partitioned data?

Any advice on this question or where to find the concerning code would be
appreciated.