You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Redmumba <re...@gmail.com> on 2014/06/09 23:18:02 UTC

Cannot query secondary index

I have a table with a timestamp column on it; however, when I try to query
based on it, it fails saying that I must use ALLOW FILTERING--which to me,
means its not using the secondary index.  Table definition is (snipping out
irrelevant parts)...

CREATE TABLE audit (
>     id bigint,
>     date timestamp,
> ...
>     PRIMARY KEY (id, date)
> );
> CREATE INDEX date_idx ON audit (date);
>

There are other fields, but they are not relevant to this example.  The
date is part of the primary key, and I have a secondary index on it.  When
I run a SELECT against it, I get an error:

cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
> Bad Request: Cannot execute this query as it might involve data filtering
> and thus may have unpredictable performance. If you want to execute this
> query despite the performance unpredictability, use ALLOW FILTERING
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
> ALLOW FILTERING;
> Request did not complete within rpc_timeout.
>

How can I force it to use the index?  I've seen rebuild_index tasks
running, but can I verify the "health" of the index?

Re: Cannot query secondary index

Posted by Jonathan Lacefield <jl...@datastax.com>.

Hello,

  You are receiving this item because you are not passing in the Partition
Key as part of your query.  Cassandra is telling you it doesn't know which
node to find the data and you haven't explicitly told it to search across
all your nodes for the data.  The ALLOW FILTERING clause bypasses the need
to pass in a partition key in your query.
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

  Big picture, for data modeling in Cassandra, it's advisable to model your
data based on the query access patterns and to duplicate data into tables
that represent your query.  In this case, creating a table with a Partition
Key of date, could benefit you.  Heavy use of ALLOW FILTERING could cause
performance issues within your cluster.

  Also, please be aware that Secondary Indexes are much different in
Cassandra-land compared to indexes in RDBMS-land.  They should be used only
when necessary, i.e. an explicit use case.  Typically, modeling your data
so you can avoid Secondary Indexes will ensure a well preforming system and
queries.

  Here's a good intro to Cassandra data modeling:
https://www.youtube.com/watch?v=HdJlsOZVGwM

  Hope this helps.

Jonathan

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>

On Mon, Jun 9, 2014 at 5:18 PM, Redmumba <re...@gmail.com> wrote:

> I have a table with a timestamp column on it; however, when I try to query
> based on it, it fails saying that I must use ALLOW FILTERING--which to me,
> means its not using the secondary index.  Table definition is (snipping out
> irrelevant parts)...
>
> CREATE TABLE audit (
>>     id bigint,
>>     date timestamp,
>> ...
>>     PRIMARY KEY (id, date)
>> );
>> CREATE INDEX date_idx ON audit (date);
>>
>
> There are other fields, but they are not relevant to this example.  The
> date is part of the primary key, and I have a secondary index on it.  When
> I run a SELECT against it, I get an error:
>
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>> Bad Request: Cannot execute this query as it might involve data filtering
>> and thus may have unpredictable performance. If you want to execute this
>> query despite the performance unpredictability, use ALLOW FILTERING
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>> ALLOW FILTERING;
>> Request did not complete within rpc_timeout.
>>
>
> How can I force it to use the index?  I've seen rebuild_index tasks
> running, but can I verify the "health" of the index?
>

Re: Cannot query secondary index

Posted by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com>.

Our approach for this scenario is to run a hadoop job that periodically
cleans old entries, but I admit it's far from ideal. Would be nice to have
a more native way to perform these kinds of tasks.

There's a legend about a compaction strategy that keeps only the N first
entries of a partition key, but I don't think it was implemented yet, but
if I remember correctly there's a JIRA ticket about it.


On Tue, Jun 10, 2014 at 3:39 PM, Redmumba <re...@gmail.com> wrote:

> Honestly, this has been by far my single biggest obstacle with Cassandra
> for time-based data--cleaning up the old data when the deletion criteria
> (i.e., date) isn't the primary key.  I've asked about a few different
> approaches, but I haven't really seen any feasible options that can be
> implemented easily.  I've seen the following:
>
>    1. Use date-based tables, then drop old tables, ala
>    "audit_table_20140610", "audit_table_20140609", etc..
>    But then I run into the issue of having to query every table--I would
>    have to execute queries against every day to get the data, and then merge
>    the data myself.  Unless, there's something in the binary driver I'm
>    missing, it doesn't sound like this would be practical.
>    2. Use a TTL
>    But then I have to basically decide on a value that works for
>    everything and, if it ever turns out I overestimated, I'm basically SOL,
>    because my cluster will be out of space.
>    3. Maintain a separate index of days to keys, and use this index as
>    the reference for which keys to delete.
>    But then this requires maintaining another index and a relatively
>    manual delete.
>
> I can't help but feel that I am just way over-engineering this, or that
> I'm missing something basic in my data model.  Except for the last
> approach, I can't help but feel that I'm overlooking something obvious.
>
> Andrew
>
>
> Of course, Jonathan, I'll do my best!
>
> It's an auditing table that, right now, uses a primary key consisting of a
> combination of a combined partition id of the region and the object id, the
> date, and the process ID.  Each event in our system will create anywhere
> from 1-20 rows, for example, and multiple parts of the system might be
> working on the same "object ID".  So the CF is constantly being appended
> to, but reads are rare.
>
> CREATE TABLE audit (
>>     id bigint,
>>     region ascii,
>>     date timestamp,
>>     pid int,
>>     PRIMARY KEY ((id, region), date, pid)
>> );
>
>
> Data is queried on a specific object ID and region.  Optionally, users can
> restrict their query to a specific date range, which the above data model
> provides.
>
> However, we generate quite a bit of data, and we want a convenient way to
> get rid of the oldest data.  Since our system scales with the time of year,
> we might get 50GB a day during peak, and 5GB of data off peak.  We could
> pick the safest number--let's say, 30 days--and set the TTL using that.
> The problem there is that, most of the year, we'll be using a very small
> percentage of our available space 90% of the year.
>
> What I'd like to be able to do is drop old tables as needed--i.e., let's
> say when we hit 80% load across the cluster (or some such metric that takes
> the cluster-wide load into account), I want to drop the oldest day's
> records until we're under 80%.  That way, we're always using the maximum
> amount of space we can, without having to worry about getting to the point
> where we run out of space cluster-wide.
>
> My thoughts are--we could always make the date part of the primary key,
> but then we'd either a) have to query the entire range of dates, or b) we'd
> have to force a small date range when querying.  What are the penalties?
> Do you have any other suggestions?
>
>
> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
> jlacefield@datastax.com> wrote:
>
>> Hello,
>>
>>   Will you please describe the use case and what you are trying to model.
>>  What are some questions/queries that you would like to serve via
>> Cassandra.  This will help the community help you a little better.
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>>  <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:
>>
>>> I've been trying to work around using "date-based tables" because I'd
>>> like to avoid the overhead.  It seems, however, that this is just not going
>>> to work.
>>>
>>> So here's a question--for these date-based tables (i.e., a table per
>>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>>> auditing data, for example, I'd need to query all 60 tables--can I do that
>>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>>> way for me to run the same query against all the tables?
>>>
>>>
>>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>>>
>>>> Ah, so the secondary indices are really secondary against the primary
>>>> key.  That makes sense.
>>>>
>>>> I'm beginning to see why the whole "date-based table" approach is the
>>>> only one I've been able to find... thanks for the quick responses, guys!
>>>>
>>>>
>>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>>> michal.michalski@boxever.com> wrote:
>>>>
>>>>> Secondary indexes internally are just CFs that map the indexed value
>>>>> to a row key which that value belongs to, so you can only query these
>>>>> indexes using "=", not ">", ">=" etc.
>>>>>
>>>>> However, your query does not require index *IF* you provide a row key
>>>>> - you can use "<" or ">" like you did for the date column, as long as you
>>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>>> work.
>>>>>
>>>>> M.
>>>>>
>>>>> Kind regards,
>>>>> Michał Michalski,
>>>>> michal.michalski@boxever.com
>>>>>
>>>>>
>>>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>>>
>>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>>> to me, means its not using the secondary index.  Table definition is
>>>>>> (snipping out irrelevant parts)...
>>>>>>
>>>>>> CREATE TABLE audit (
>>>>>>>     id bigint,
>>>>>>>     date timestamp,
>>>>>>> ...
>>>>>>>     PRIMARY KEY (id, date)
>>>>>>> );
>>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>>
>>>>>>
>>>>>> There are other fields, but they are not relevant to this example.
>>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>>> When I run a SELECT against it, I get an error:
>>>>>>
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>>> FILTERING
>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>> '2014-05-01' ALLOW FILTERING;
>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>
>>>>>>
>>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>>> running, but can I verify the "health" of the index?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Re: Cannot query secondary index

Posted by Mohit Anchlia <mo...@gmail.com>.

Some other ways to track old records is:

1) Use external queues - One queue per week or month  for instance and pile
up data on the queue cluster
2) Create one more table in C* to track the keys per week or month that you
can scan to read the keys of the audit table. Make sure you delete the
entire row after deleting the data instead of columns because that is more
efficient.

On Fri, Jun 13, 2014 at 2:01 PM, Jonathan Lacefield <jlacefield@datastax.com
> wrote:

> Hello,
>
>   What you are attempting to do, reminds me of the old "sliding window"
> partitioning trick in rdbms systems.   You're right, there is no system
> provided tool that allows you to preform a similar operation.  You could
> always leverage option 3, and then create a service that helps manage the
> effort of the manual delete.  However, you would still have to insert into
> this separate table per the index item.
>
>   The cost of the "every once in a while" delete may be infrequent enough
> for you to do what you were actually trying to do in the first place, use a
> secondary index and query the table leveraging the ALLOW FILTERING clause.
>
>   My recommendation would be to:
>
>   1)  leverage TTLs
>   2)  see what type of load your original plan would put on your system,
> and if it's acceptable and or you have a down time to execute this costly
> operation, go that route when TTLs aren't keeping up with your load.
>
>   You have such a special use case for this functionality that the little,
> in frequent, performance hit outweighs the complexity of implementing
> options 1 and 3.
>
>   Thanks,
>
> Jonathan
>
>
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
> <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Tue, Jun 10, 2014 at 1:39 PM, Redmumba <re...@gmail.com> wrote:
>
>> Honestly, this has been by far my single biggest obstacle with Cassandra
>> for time-based data--cleaning up the old data when the deletion criteria
>> (i.e., date) isn't the primary key.  I've asked about a few different
>> approaches, but I haven't really seen any feasible options that can be
>> implemented easily.  I've seen the following:
>>
>>    1. Use date-based tables, then drop old tables, ala
>>    "audit_table_20140610", "audit_table_20140609", etc..
>>    But then I run into the issue of having to query every table--I would
>>    have to execute queries against every day to get the data, and then merge
>>    the data myself.  Unless, there's something in the binary driver I'm
>>    missing, it doesn't sound like this would be practical.
>>    2. Use a TTL
>>    But then I have to basically decide on a value that works for
>>    everything and, if it ever turns out I overestimated, I'm basically SOL,
>>    because my cluster will be out of space.
>>    3. Maintain a separate index of days to keys, and use this index as
>>    the reference for which keys to delete.
>>    But then this requires maintaining another index and a relatively
>>    manual delete.
>>
>> I can't help but feel that I am just way over-engineering this, or that
>> I'm missing something basic in my data model.  Except for the last
>> approach, I can't help but feel that I'm overlooking something obvious.
>>
>> Andrew
>>
>>
>> Of course, Jonathan, I'll do my best!
>>
>> It's an auditing table that, right now, uses a primary key consisting of
>> a combination of a combined partition id of the region and the object id,
>> the date, and the process ID.  Each event in our system will create
>> anywhere from 1-20 rows, for example, and multiple parts of the system
>> might be working on the same "object ID".  So the CF is constantly being
>> appended to, but reads are rare.
>>
>> CREATE TABLE audit (
>>>     id bigint,
>>>     region ascii,
>>>     date timestamp,
>>>     pid int,
>>>     PRIMARY KEY ((id, region), date, pid)
>>> );
>>
>>
>> Data is queried on a specific object ID and region.  Optionally, users
>> can restrict their query to a specific date range, which the above data
>> model provides.
>>
>> However, we generate quite a bit of data, and we want a convenient way to
>> get rid of the oldest data.  Since our system scales with the time of year,
>> we might get 50GB a day during peak, and 5GB of data off peak.  We could
>> pick the safest number--let's say, 30 days--and set the TTL using that.
>> The problem there is that, most of the year, we'll be using a very small
>> percentage of our available space 90% of the year.
>>
>> What I'd like to be able to do is drop old tables as needed--i.e., let's
>> say when we hit 80% load across the cluster (or some such metric that takes
>> the cluster-wide load into account), I want to drop the oldest day's
>> records until we're under 80%.  That way, we're always using the maximum
>> amount of space we can, without having to worry about getting to the point
>> where we run out of space cluster-wide.
>>
>> My thoughts are--we could always make the date part of the primary key,
>> but then we'd either a) have to query the entire range of dates, or b) we'd
>> have to force a small date range when querying.  What are the penalties?
>> Do you have any other suggestions?
>>
>>
>> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
>> jlacefield@datastax.com> wrote:
>>
>>> Hello,
>>>
>>>   Will you please describe the use case and what you are trying to
>>> model.  What are some questions/queries that you would like to serve via
>>> Cassandra.  This will help the community help you a little better.
>>>
>>> Jonathan Lacefield
>>> Solutions Architect, DataStax
>>> (404) 822 3487
>>>  <http://www.linkedin.com/in/jlacefield>
>>>
>>> <http://www.datastax.com/cassandrasummit14>
>>>
>>>
>>>
>>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:
>>>
>>>> I've been trying to work around using "date-based tables" because I'd
>>>> like to avoid the overhead.  It seems, however, that this is just not going
>>>> to work.
>>>>
>>>> So here's a question--for these date-based tables (i.e., a table per
>>>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>>>> auditing data, for example, I'd need to query all 60 tables--can I do that
>>>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>>>> way for me to run the same query against all the tables?
>>>>
>>>>
>>>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>>>>
>>>>> Ah, so the secondary indices are really secondary against the primary
>>>>> key.  That makes sense.
>>>>>
>>>>> I'm beginning to see why the whole "date-based table" approach is the
>>>>> only one I've been able to find... thanks for the quick responses, guys!
>>>>>
>>>>>
>>>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>>>> michal.michalski@boxever.com> wrote:
>>>>>
>>>>>> Secondary indexes internally are just CFs that map the indexed value
>>>>>> to a row key which that value belongs to, so you can only query these
>>>>>> indexes using "=", not ">", ">=" etc.
>>>>>>
>>>>>> However, your query does not require index *IF* you provide a row key
>>>>>> - you can use "<" or ">" like you did for the date column, as long as you
>>>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>>>> work.
>>>>>>
>>>>>> M.
>>>>>>
>>>>>> Kind regards,
>>>>>> Michał Michalski,
>>>>>> michal.michalski@boxever.com
>>>>>>
>>>>>>
>>>>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>>>> to me, means its not using the secondary index.  Table definition is
>>>>>>> (snipping out irrelevant parts)...
>>>>>>>
>>>>>>> CREATE TABLE audit (
>>>>>>>>     id bigint,
>>>>>>>>     date timestamp,
>>>>>>>> ...
>>>>>>>>     PRIMARY KEY (id, date)
>>>>>>>> );
>>>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>>>
>>>>>>>
>>>>>>> There are other fields, but they are not relevant to this example.
>>>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>>>> When I run a SELECT against it, I get an error:
>>>>>>>
>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>>> '2014-05-01';
>>>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>>>> FILTERING
>>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>>> '2014-05-01' ALLOW FILTERING;
>>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>>
>>>>>>>
>>>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>>>> running, but can I verify the "health" of the index?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cannot query secondary index

Posted by Jonathan Lacefield <jl...@datastax.com>.

Hello,

  What you are attempting to do, reminds me of the old "sliding window"
partitioning trick in rdbms systems.   You're right, there is no system
provided tool that allows you to preform a similar operation.  You could
always leverage option 3, and then create a service that helps manage the
effort of the manual delete.  However, you would still have to insert into
this separate table per the index item.

  The cost of the "every once in a while" delete may be infrequent enough
for you to do what you were actually trying to do in the first place, use a
secondary index and query the table leveraging the ALLOW FILTERING clause.

  My recommendation would be to:

  1)  leverage TTLs
  2)  see what type of load your original plan would put on your system,
and if it's acceptable and or you have a down time to execute this costly
operation, go that route when TTLs aren't keeping up with your load.

  You have such a special use case for this functionality that the little,
in frequent, performance hit outweighs the complexity of implementing
options 1 and 3.

  Thanks,

Jonathan



Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Tue, Jun 10, 2014 at 1:39 PM, Redmumba <re...@gmail.com> wrote:

> Honestly, this has been by far my single biggest obstacle with Cassandra
> for time-based data--cleaning up the old data when the deletion criteria
> (i.e., date) isn't the primary key.  I've asked about a few different
> approaches, but I haven't really seen any feasible options that can be
> implemented easily.  I've seen the following:
>
>    1. Use date-based tables, then drop old tables, ala
>    "audit_table_20140610", "audit_table_20140609", etc..
>    But then I run into the issue of having to query every table--I would
>    have to execute queries against every day to get the data, and then merge
>    the data myself.  Unless, there's something in the binary driver I'm
>    missing, it doesn't sound like this would be practical.
>    2. Use a TTL
>    But then I have to basically decide on a value that works for
>    everything and, if it ever turns out I overestimated, I'm basically SOL,
>    because my cluster will be out of space.
>    3. Maintain a separate index of days to keys, and use this index as
>    the reference for which keys to delete.
>    But then this requires maintaining another index and a relatively
>    manual delete.
>
> I can't help but feel that I am just way over-engineering this, or that
> I'm missing something basic in my data model.  Except for the last
> approach, I can't help but feel that I'm overlooking something obvious.
>
> Andrew
>
>
> Of course, Jonathan, I'll do my best!
>
> It's an auditing table that, right now, uses a primary key consisting of a
> combination of a combined partition id of the region and the object id, the
> date, and the process ID.  Each event in our system will create anywhere
> from 1-20 rows, for example, and multiple parts of the system might be
> working on the same "object ID".  So the CF is constantly being appended
> to, but reads are rare.
>
> CREATE TABLE audit (
>>     id bigint,
>>     region ascii,
>>     date timestamp,
>>     pid int,
>>     PRIMARY KEY ((id, region), date, pid)
>> );
>
>
> Data is queried on a specific object ID and region.  Optionally, users can
> restrict their query to a specific date range, which the above data model
> provides.
>
> However, we generate quite a bit of data, and we want a convenient way to
> get rid of the oldest data.  Since our system scales with the time of year,
> we might get 50GB a day during peak, and 5GB of data off peak.  We could
> pick the safest number--let's say, 30 days--and set the TTL using that.
> The problem there is that, most of the year, we'll be using a very small
> percentage of our available space 90% of the year.
>
> What I'd like to be able to do is drop old tables as needed--i.e., let's
> say when we hit 80% load across the cluster (or some such metric that takes
> the cluster-wide load into account), I want to drop the oldest day's
> records until we're under 80%.  That way, we're always using the maximum
> amount of space we can, without having to worry about getting to the point
> where we run out of space cluster-wide.
>
> My thoughts are--we could always make the date part of the primary key,
> but then we'd either a) have to query the entire range of dates, or b) we'd
> have to force a small date range when querying.  What are the penalties?
> Do you have any other suggestions?
>
>
> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
> jlacefield@datastax.com> wrote:
>
>> Hello,
>>
>>   Will you please describe the use case and what you are trying to model.
>>  What are some questions/queries that you would like to serve via
>> Cassandra.  This will help the community help you a little better.
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>>  <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:
>>
>>> I've been trying to work around using "date-based tables" because I'd
>>> like to avoid the overhead.  It seems, however, that this is just not going
>>> to work.
>>>
>>> So here's a question--for these date-based tables (i.e., a table per
>>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>>> auditing data, for example, I'd need to query all 60 tables--can I do that
>>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>>> way for me to run the same query against all the tables?
>>>
>>>
>>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>>>
>>>> Ah, so the secondary indices are really secondary against the primary
>>>> key.  That makes sense.
>>>>
>>>> I'm beginning to see why the whole "date-based table" approach is the
>>>> only one I've been able to find... thanks for the quick responses, guys!
>>>>
>>>>
>>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>>> michal.michalski@boxever.com> wrote:
>>>>
>>>>> Secondary indexes internally are just CFs that map the indexed value
>>>>> to a row key which that value belongs to, so you can only query these
>>>>> indexes using "=", not ">", ">=" etc.
>>>>>
>>>>> However, your query does not require index *IF* you provide a row key
>>>>> - you can use "<" or ">" like you did for the date column, as long as you
>>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>>> work.
>>>>>
>>>>> M.
>>>>>
>>>>> Kind regards,
>>>>> Michał Michalski,
>>>>> michal.michalski@boxever.com
>>>>>
>>>>>
>>>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>>>
>>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>>> to me, means its not using the secondary index.  Table definition is
>>>>>> (snipping out irrelevant parts)...
>>>>>>
>>>>>> CREATE TABLE audit (
>>>>>>>     id bigint,
>>>>>>>     date timestamp,
>>>>>>> ...
>>>>>>>     PRIMARY KEY (id, date)
>>>>>>> );
>>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>>
>>>>>>
>>>>>> There are other fields, but they are not relevant to this example.
>>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>>> When I run a SELECT against it, I get an error:
>>>>>>
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>>> FILTERING
>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>> '2014-05-01' ALLOW FILTERING;
>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>
>>>>>>
>>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>>> running, but can I verify the "health" of the index?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cannot query secondary index

Posted by Redmumba <re...@gmail.com>.

Honestly, this has been by far my single biggest obstacle with Cassandra
for time-based data--cleaning up the old data when the deletion criteria
(i.e., date) isn't the primary key.  I've asked about a few different
approaches, but I haven't really seen any feasible options that can be
implemented easily.  I've seen the following:

   1. Use date-based tables, then drop old tables, ala
   "audit_table_20140610", "audit_table_20140609", etc..
   But then I run into the issue of having to query every table--I would
   have to execute queries against every day to get the data, and then merge
   the data myself.  Unless, there's something in the binary driver I'm
   missing, it doesn't sound like this would be practical.
   2. Use a TTL
   But then I have to basically decide on a value that works for everything
   and, if it ever turns out I overestimated, I'm basically SOL, because my
   cluster will be out of space.
   3. Maintain a separate index of days to keys, and use this index as the
   reference for which keys to delete.
   But then this requires maintaining another index and a relatively manual
   delete.

I can't help but feel that I am just way over-engineering this, or that I'm
missing something basic in my data model.  Except for the last approach, I
can't help but feel that I'm overlooking something obvious.

Andrew

Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same "object ID".  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
>     id bigint,
>     region ascii,
>     date timestamp,
>     pid int,
>     PRIMARY KEY ((id, region), date, pid)
> );

Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?

On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <jl...@datastax.com>
wrote:

> Hello,
>
>   Will you please describe the use case and what you are trying to model.
>  What are some questions/queries that you would like to serve via
> Cassandra.  This will help the community help you a little better.
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:
>
>> I've been trying to work around using "date-based tables" because I'd
>> like to avoid the overhead.  It seems, however, that this is just not going
>> to work.
>>
>> So here's a question--for these date-based tables (i.e., a table per
>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>> auditing data, for example, I'd need to query all 60 tables--can I do that
>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>> way for me to run the same query against all the tables?
>>
>>
>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>>
>>> Ah, so the secondary indices are really secondary against the primary
>>> key.  That makes sense.
>>>
>>> I'm beginning to see why the whole "date-based table" approach is the
>>> only one I've been able to find... thanks for the quick responses, guys!
>>>
>>>
>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>> michal.michalski@boxever.com> wrote:
>>>
>>>> Secondary indexes internally are just CFs that map the indexed value to
>>>> a row key which that value belongs to, so you can only query these indexes
>>>> using "=", not ">", ">=" etc.
>>>>
>>>> However, your query does not require index *IF* you provide a row key -
>>>> you can use "<" or ">" like you did for the date column, as long as you
>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>> work.
>>>>
>>>> M.
>>>>
>>>> Kind regards,
>>>> Michał Michalski,
>>>> michal.michalski@boxever.com
>>>>
>>>>
>>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>>
>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>> to me, means its not using the secondary index.  Table definition is
>>>>> (snipping out irrelevant parts)...
>>>>>
>>>>> CREATE TABLE audit (
>>>>>>     id bigint,
>>>>>>     date timestamp,
>>>>>> ...
>>>>>>     PRIMARY KEY (id, date)
>>>>>> );
>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>
>>>>>
>>>>> There are other fields, but they are not relevant to this example.
>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>> When I run a SELECT against it, I get an error:
>>>>>
>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>> FILTERING
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>>>> ALLOW FILTERING;
>>>>>> Request did not complete within rpc_timeout.
>>>>>>
>>>>>
>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>> running, but can I verify the "health" of the index?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Cannot query secondary index

Posted by Redmumba <re...@gmail.com>.

Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same "object ID".  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
>     id bigint,
>     region ascii,
>     date timestamp,
>     pid int,
>     PRIMARY KEY ((id, region), date, pid)
> );

Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?

On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <jl...@datastax.com>
wrote:

> Hello,
>
>   Will you please describe the use case and what you are trying to model.
>  What are some questions/queries that you would like to serve via
> Cassandra.  This will help the community help you a little better.
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:
>
>> I've been trying to work around using "date-based tables" because I'd
>> like to avoid the overhead.  It seems, however, that this is just not going
>> to work.
>>
>> So here's a question--for these date-based tables (i.e., a table per
>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>> auditing data, for example, I'd need to query all 60 tables--can I do that
>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>> way for me to run the same query against all the tables?
>>
>>
>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>>
>>> Ah, so the secondary indices are really secondary against the primary
>>> key.  That makes sense.
>>>
>>> I'm beginning to see why the whole "date-based table" approach is the
>>> only one I've been able to find... thanks for the quick responses, guys!
>>>
>>>
>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>> michal.michalski@boxever.com> wrote:
>>>
>>>> Secondary indexes internally are just CFs that map the indexed value to
>>>> a row key which that value belongs to, so you can only query these indexes
>>>> using "=", not ">", ">=" etc.
>>>>
>>>> However, your query does not require index *IF* you provide a row key -
>>>> you can use "<" or ">" like you did for the date column, as long as you
>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>> work.
>>>>
>>>> M.
>>>>
>>>> Kind regards,
>>>> Michał Michalski,
>>>> michal.michalski@boxever.com
>>>>
>>>>
>>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>>
>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>> to me, means its not using the secondary index.  Table definition is
>>>>> (snipping out irrelevant parts)...
>>>>>
>>>>> CREATE TABLE audit (
>>>>>>     id bigint,
>>>>>>     date timestamp,
>>>>>> ...
>>>>>>     PRIMARY KEY (id, date)
>>>>>> );
>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>
>>>>>
>>>>> There are other fields, but they are not relevant to this example.
>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>> When I run a SELECT against it, I get an error:
>>>>>
>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>> FILTERING
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>>>> ALLOW FILTERING;
>>>>>> Request did not complete within rpc_timeout.
>>>>>>
>>>>>
>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>> running, but can I verify the "health" of the index?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Cannot query secondary index

Posted by Jonathan Lacefield <jl...@datastax.com>.

Hello,

  Will you please describe the use case and what you are trying to model.
 What are some questions/queries that you would like to serve via
Cassandra.  This will help the community help you a little better.

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <re...@gmail.com> wrote:

> I've been trying to work around using "date-based tables" because I'd like
> to avoid the overhead.  It seems, however, that this is just not going to
> work.
>
> So here's a question--for these date-based tables (i.e., a table per
> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
> auditing data, for example, I'd need to query all 60 tables--can I do that
> smoothly?  Or do I have to have 60 different select statements?  Is there a
> way for me to run the same query against all the tables?
>
>
> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:
>
>> Ah, so the secondary indices are really secondary against the primary
>> key.  That makes sense.
>>
>> I'm beginning to see why the whole "date-based table" approach is the
>> only one I've been able to find... thanks for the quick responses, guys!
>>
>>
>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>> michal.michalski@boxever.com> wrote:
>>
>>> Secondary indexes internally are just CFs that map the indexed value to
>>> a row key which that value belongs to, so you can only query these indexes
>>> using "=", not ">", ">=" etc.
>>>
>>> However, your query does not require index *IF* you provide a row key -
>>> you can use "<" or ">" like you did for the date column, as long as you
>>> refer to a single row. However, if you don't provide it, it's not going to
>>> work.
>>>
>>> M.
>>>
>>> Kind regards,
>>> Michał Michalski,
>>> michal.michalski@boxever.com
>>>
>>>
>>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>>
>>>> I have a table with a timestamp column on it; however, when I try to
>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>> to me, means its not using the secondary index.  Table definition is
>>>> (snipping out irrelevant parts)...
>>>>
>>>> CREATE TABLE audit (
>>>>>     id bigint,
>>>>>     date timestamp,
>>>>> ...
>>>>>     PRIMARY KEY (id, date)
>>>>> );
>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>
>>>>
>>>> There are other fields, but they are not relevant to this example.  The
>>>> date is part of the primary key, and I have a secondary index on it.  When
>>>> I run a SELECT against it, I get an error:
>>>>
>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>> FILTERING
>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>>> ALLOW FILTERING;
>>>>> Request did not complete within rpc_timeout.
>>>>>
>>>>
>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>> running, but can I verify the "health" of the index?
>>>>
>>>
>>>
>>
>

Re: Cannot query secondary index

Posted by Redmumba <re...@gmail.com>.

I've been trying to work around using "date-based tables" because I'd like
to avoid the overhead.  It seems, however, that this is just not going to
work.

So here's a question--for these date-based tables (i.e., a table per
day/week/month/whatever), how are they queried?  If I keep 60 days worth of
auditing data, for example, I'd need to query all 60 tables--can I do that
smoothly?  Or do I have to have 60 different select statements?  Is there a
way for me to run the same query against all the tables?


On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <re...@gmail.com> wrote:

> Ah, so the secondary indices are really secondary against the primary
> key.  That makes sense.
>
> I'm beginning to see why the whole "date-based table" approach is the only
> one I've been able to find... thanks for the quick responses, guys!
>
>
> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
> michal.michalski@boxever.com> wrote:
>
>> Secondary indexes internally are just CFs that map the indexed value to a
>> row key which that value belongs to, so you can only query these indexes
>> using "=", not ">", ">=" etc.
>>
>> However, your query does not require index *IF* you provide a row key -
>> you can use "<" or ">" like you did for the date column, as long as you
>> refer to a single row. However, if you don't provide it, it's not going to
>> work.
>>
>> M.
>>
>> Kind regards,
>> Michał Michalski,
>> michal.michalski@boxever.com
>>
>>
>> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>>
>>> I have a table with a timestamp column on it; however, when I try to
>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>> to me, means its not using the secondary index.  Table definition is
>>> (snipping out irrelevant parts)...
>>>
>>> CREATE TABLE audit (
>>>>     id bigint,
>>>>     date timestamp,
>>>> ...
>>>>     PRIMARY KEY (id, date)
>>>> );
>>>> CREATE INDEX date_idx ON audit (date);
>>>>
>>>
>>> There are other fields, but they are not relevant to this example.  The
>>> date is part of the primary key, and I have a secondary index on it.  When
>>> I run a SELECT against it, I get an error:
>>>
>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>> Bad Request: Cannot execute this query as it might involve data
>>>> filtering and thus may have unpredictable performance. If you want to
>>>> execute this query despite the performance unpredictability, use ALLOW
>>>> FILTERING
>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>> ALLOW FILTERING;
>>>> Request did not complete within rpc_timeout.
>>>>
>>>
>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>> running, but can I verify the "health" of the index?
>>>
>>
>>
>

Re: Cannot query secondary index

Posted by Redmumba <re...@gmail.com>.

Ah, so the secondary indices are really secondary against the primary key.
That makes sense.

I'm beginning to see why the whole "date-based table" approach is the only
one I've been able to find... thanks for the quick responses, guys!


On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
michal.michalski@boxever.com> wrote:

> Secondary indexes internally are just CFs that map the indexed value to a
> row key which that value belongs to, so you can only query these indexes
> using "=", not ">", ">=" etc.
>
> However, your query does not require index *IF* you provide a row key -
> you can use "<" or ">" like you did for the date column, as long as you
> refer to a single row. However, if you don't provide it, it's not going to
> work.
>
> M.
>
> Kind regards,
> Michał Michalski,
> michal.michalski@boxever.com
>
>
> On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:
>
>> I have a table with a timestamp column on it; however, when I try to
>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>> to me, means its not using the secondary index.  Table definition is
>> (snipping out irrelevant parts)...
>>
>> CREATE TABLE audit (
>>>     id bigint,
>>>     date timestamp,
>>> ...
>>>     PRIMARY KEY (id, date)
>>> );
>>> CREATE INDEX date_idx ON audit (date);
>>>
>>
>> There are other fields, but they are not relevant to this example.  The
>> date is part of the primary key, and I have a secondary index on it.  When
>> I run a SELECT against it, I get an error:
>>
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>> Bad Request: Cannot execute this query as it might involve data
>>> filtering and thus may have unpredictable performance. If you want to
>>> execute this query despite the performance unpredictability, use ALLOW
>>> FILTERING
>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>> ALLOW FILTERING;
>>> Request did not complete within rpc_timeout.
>>>
>>
>> How can I force it to use the index?  I've seen rebuild_index tasks
>> running, but can I verify the "health" of the index?
>>
>
>

Re: Cannot query secondary index

Posted by Michal Michalski <mi...@boxever.com>.

Secondary indexes internally are just CFs that map the indexed value to a
row key which that value belongs to, so you can only query these indexes
using "=", not ">", ">=" etc.

However, your query does not require index *IF* you provide a row key - you
can use "<" or ">" like you did for the date column, as long as you refer
to a single row. However, if you don't provide it, it's not going to work.

M.

Kind regards,
Michał Michalski,
michal.michalski@boxever.com


On 9 June 2014 21:18, Redmumba <re...@gmail.com> wrote:

> I have a table with a timestamp column on it; however, when I try to query
> based on it, it fails saying that I must use ALLOW FILTERING--which to me,
> means its not using the secondary index.  Table definition is (snipping out
> irrelevant parts)...
>
> CREATE TABLE audit (
>>     id bigint,
>>     date timestamp,
>> ...
>>     PRIMARY KEY (id, date)
>> );
>> CREATE INDEX date_idx ON audit (date);
>>
>
> There are other fields, but they are not relevant to this example.  The
> date is part of the primary key, and I have a secondary index on it.  When
> I run a SELECT against it, I get an error:
>
> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>> Bad Request: Cannot execute this query as it might involve data filtering
>> and thus may have unpredictable performance. If you want to execute this
>> query despite the performance unpredictability, use ALLOW FILTERING
>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>> ALLOW FILTERING;
>> Request did not complete within rpc_timeout.
>>
>
> How can I force it to use the index?  I've seen rebuild_index tasks
> running, but can I verify the "health" of the index?
>