You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alexandru Sicoe <ad...@gmail.com> on 2012/04/03 20:12:04 UTC

Re: composite query performance depends on component ordering

Hi Sylvain and Aaron,

Thanks for the comment Sylvain, what you say makes sense, I have
microsecond precision timestamps and looking at some row printouts I see
everything is happening at a different timestamp which means that it won't
compare the second 100 bytes component.

As for the methodology it's not so thorough. I used Cassandra 0.8.5.

What I did is I had acquired a large data set about 300 hrs worth of data
in Schema 1 (details below) which I found was easily hitting thousands of
rows for some queries, thus giving me very poor performance. I converted
this data to Schema 2 (details below) thus grouping the data together in
the same row and increasing the time bucket for the row (with two versions
"Timestamp:ID" and "ID:Timestamp" for the column names). So I obtained a CF
with 66 rows, 11 rows for 3 different types of data sources which are
dominant in the rates of info they give me (each row is a 24 hr time
bucket).

These are the results I got using the CompositeQueryIterator (with a
modified max of 100.000 cols returned per slice) taken from the Composite
query tutorial at
http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1(code
is at
https://github.com/zznate/cassandra-tutorial). So basically I used null for
start and end in order to read entire rows at a time. I timed my code. The
actual values are doubles for all 3 types. The size is the file size after
dumping the results to a text file.

Ok, in my previous email I just looked at the rows with the max size which
gave me a 20% difference. In earnest it's less.


   Type1

ID:Timestamp Timestamp:ID

 No. Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff
%
 387174 25M 12.59 8.6 31.68
 1005113 66M 31.83 21.84 31.38
 579633 38M 18.07 12.46 31.03
 1217634 81M 33.77 24.65 26.99
 376303 24M 12.32 10.36 15.94
 2493007 169M 68.68 59.93 12.74
 6298275 428M 183.28 147.57 19.48
 2777962 189M 83.16 73.3 11.86
 6138047 416M 170.88 155.83 8.81
 3193450 216M 93.26 82.84 11.18
 2302928 155M 69.91 61.62 11.85




Avg 19.3 %


   Type 2

ID:Timestamp Timestamp:ID
 No Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff %
350468 40M 12.92 13.12 -1.59  1303797 148M 43.33 38.98 10.04  697763 79M
26.78 22.05 17.66  825414 94M 33.5 26.69 20.31  55075 6.2M 2.97 2.13 28.15
1873775 213M 72.37 51.12 29.37  3982433 453M 147.04 110.71 24.71  1546491
176M 54.86 42.13 23.21  4117491 468M 143.1 114.62 19.9  1747506 199M 63.23
63.05 0.28  2720160 308M 96.06 82.47 14.14



Avg = 16.9 %

   Type 3

ID:Timestamp Timestamp:ID
 No Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff %
192667 7.2M 5.88 6.5 -10.49  210593 7.9M 6.33 5.57 12.06  144677 5.4M 3.78
3.74 1.22  207706 7.7M 6.33 5.74 9.28  235937 8.7M 6.34 6.11 3.64  159985
6.0M 4.23 3.93 7.07  134859 5.5M 3.91 3.38 13.46  70545 2.9M 2.96 2.08 29.84
98487 3.9M 4.04 2.62 35.22  205979 8.2M 7.35 5.67 22.87  166045 6.2M 5.12
3.99 22.1



Avg = 13.3 %

Just to understand why I did the tests.

Data set:
I have ~300.000 data sources. Each data source has several variables it can
output values for. There are ~12 variables / data source. This gives ~4
million independent time series (let's call them streams) that need to go
into Cassandra. The streams give me (timestamp,value) pairs at higly
different rates, depending on the data source it comes from and operating
conditions. This translates into very different row lengths if a unique
time bucket is used across all streams.

The data sources can be further grouped in types (several data sources can
share the same type). There are ~100 types.

Use case:
The system
- will serve a web dashboard.
- should allow queries at highest granularity for short periods of time (up
to between 4-8hrs) on any individual stream or grouping of streams
- should allow a method of obtaining on demand (offline) analytics over
long periods of time (up to 1 year) and then (real-time) querying on the
analytics data

Cassandra schemes used so far:
Schema 1: 1 row for each of the 3 million streams. Each row is a 4hr time
bucket.
Schema 2: 1 row for each of the 100 types. Each row is an 24hr time bucket.

Now I'm planning to use Schema 2 only with an 8hr time bucket to better
reconcile between rows that get very long and ones that don't.

Cheers,
Alex


On Sat, Mar 31, 2012 at 9:35 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Can you post the details of the queries you are running, including the
> methodology of the tests ?
>
> (Here is the methodology I used to time queries previously
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/)
>
> Cheers
>
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 31/03/2012, at 1:29 AM, Alexandru Sicoe wrote:
>
> Hi guys,
>  I am consistently seeing a 20% improvement in query retrieval times if I
> use the composite comparator "Timestamp:ID" instead of "ID:Timestamp" where
> Timestamp=Long and ID=~100 character strings. I am retrieving all columns
> (~1 million) from a single row. Why is this happening?
>
> Cheers,
> Alex
>
>
>