You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Voytek Jarnot <vo...@gmail.com> on 2017/04/01 03:26:13 UTC

Tracing output regarding number of sstables hit and am I chasing my tail

Was about to optimize some queries, given tracing out, but then saw
CASSANDRA-13120 (https://issues.apache.org/jira/browse/CASSANDRA-13120) and
am now wondering if there's nothing to be gained.

We have a table with a (quite simplified/sanitized) structure as such:

created_week_year int,
created_week int,
created_date date,
lots of value columns,
primary key((created_week_year, created_week), created_date, two others)

So, fundamentally time-series data, partitioned by calendar week.

If, for example, a user executes a query covering a 30-day timespan, we
split and parallelize the query by partition, so for a query from
2017-03-01 to 2017-03-31, we'll execute multiple queries as such:

select * from tab where created_week_year=2017 and created_week=13 and
created_date <= '2017-03-31' and created_date >= '2017-03-01';

select * from tab where created_week_year=2017 and created_week=12 and
created_date <= '2017-03-31' and created_date >= '2017-03-01'

and so forth, one query for every week partition.

Notice that we do not bother to narrow the created_date params to match the
created_week begin/end dates - the only thing that changes from query to
query is the created_week_year and created_week.

Performance isn't great in the current setup, and I was thinking a valid
optimization would be to change things such that in addition to specifying
a unique created_week_year and created_week parameter, we also calculate
the start-of-week date and end-of-week date client side as such:

select * from tab where created_week_year=2017 and created_week=13 and
created_date <= '2017-03-31' and created_date >= '2017-03-27'.

I did some tracing in cqlsh, and it does seem like this would help, fewer
sstables are merged in when the date ranges are more-specifically
restricted. However, CASSANDRA-13120 seems to indicate that that is simply
false tracing output, and there are no gains to be had from this
optimization.

Really looking for confirmation one way or another on this. The client-side
change is not all that significant to re-calculate the date-range for each
week partition, but if there nothing to be gained, then we don't need to
waste our time doing it.

Thank you.