You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bhuvan Rawal <bh...@gmail.com> on 2016/09/25 17:06:01 UTC

Iterating over a table with multiple producers [Python]

Hi,

Its a common occurrence where full scan of Cassandra table is required. One
of the most common requirement is to get the count of rows in a table. As
Cassandra doesn't keep count information stored anywhere (A node may not
have any clue about writes happening on other nodes) when we aggregate
using count(*) essentially all rows are being sent to coordinator by other
nodes, which is not really recommended and adds pressure on the
coordinator's heap.

Another common use case may be to filter by a cell value on which secondary
index isnt created. It may not be efficient to create a secondary index for
a just a one off requirement, a single full scan can again resolve it.

I worked on scans using single producer but it became pretty time consuming
as the size of table grew big. With motivation from java driver test cases
-(Datastax Java Driver
<https://github.com/datastax/java-driver/blob/3.x/driver-core/src/test/java/com/datastax/driver/core/TokenIntegrationTest.java#L94>)
I worked on multi token range scan.

This approach gave pretty interesting results (of course depending on the
client machine and cluster size) and i though of sharing it with @users. We
have achieved in excess of 1.5 Million Rows per second scan by using 50
workers on a 6 node cluster which was pretty cool. It can be made faster on
larger clusters and better client machine. This approach has been tested on
a 710 Mil Row table and the scan took 473 seconds without overwhelming
Cassandra nodes.

This has been discussed in detail on my blog
<http://casualreflections.io/tech/cassandra/python/Multiprocess-Producer-Cassandra-Python>,
sample code at github
<https://gist.github.com/bhuvanrawal/93c5ae6cdd020de47e0981d36d2c0785>.
Feel free to reach out if I can help / there could be better way out.

Regards,
Bhuvan

# Note - 1. Paging set reinjection feature has been used in case of
exception which is new in 3.7 driver which makes failover pretty easy
# This may not beat spark but If you dont have spark infra setup in
locality with Cassandra this could be a pretty good way to get things done
quickly.