You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Stefan Fuchs <St...@willhaben.at> on 2017/04/19 12:33:08 UTC

Cassandra Client very CPU intensive

We are using Spark to do aggregations of our Cassandra data. We recognized even quite simple jobs to be way to slow and narrowed it down to the data fetching from C*. It would not be surprising that the most of the time is spend on fetching the data. But actually for a query (using the sparkContext.cassandraTable... with a direct cache afterwards) that takes around a minute, Spark is only connected to Cassandra for 10 seconds.

To see if it is a Spark specific issure, we ran a similar query directly using CQLSH, that also spends only a fraction of the time waiting for the server:
time cqlsh --request-timeout 90 192.168.0.189 -e "PAGING OFF; CONSISTENCY LOCAL_ONE; TRACING ON; select ev_time, event_type_id from bds.ad_event_history where ev_time_slice = '2017-04-01 00:00:00+0000' and ev_time_slice_bucket = 0 limit 100000;" > /dev/null 2> /dev/null

real    0m10.060s
user    0m8.864s
sys    0m0.783s

Checking TOP in parallel shows that CQLSH occupies a whole core for a couple of seconds. Looking at the figures, also in this case only about a second is spend on waiting for the server.

So the Cassandra client seems to do quite a lot with the data it recieves. Are there any tweaks? Or can someone at least explain what is going on there? Holding/processing 100k rows with 3 UUID, 5 Int and one Timestamp should not take 10x as much as fetching data over the network, shouldn't it...?

AW: Cassandra Client very CPU intensive

Posted by Stefan Fuchs <St...@willhaben.at>.

Ah, by the way - we are using Cassandra 3.10 and the spark connector version 2.0.0