You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Gaurav Bhatnagar <ga...@gmail.com> on 2015/02/24 18:50:46 UTC

how to scan all rows of cassandra using multiple threads

Hi,
     I have a cassandra cluster of 3 nodes holding around 300 million rows
of items. I have a replication factor of 3 with read/write consistency as
Quorum. I want to scan all rows of database to generate sum of items having
value "available" in column name state and value "batch1" in column name
batch. Row key for item is a 15 digit random number.
    I want to do this processing in multiple threads for instance one
thread generating sum for one portion of data and other thread generating
sum for another disjoint portion of data and later I would add up total
from these 2 threads to get final sum.
    What can be the possible way to achieve this? Can I use concept of
virtual nodes here. Each node owns set of virtual nodes.
     Can I get data owned by a particular node and this way generate sum on
different nodes by iterating over data from virtual nodes and later
generate total sum by doing sum of data from all virtual nodes.

Regards,
Gaurav

Re: how to scan all rows of cassandra using multiple threads

Posted by Clint Kelly <cl...@gmail.com>.
Hi Gaurav,

I recommend you just run a MapReduce job for this computation.

Alternatively, you can look at the code for the C* MapReduce input format:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/hadoop/cql3/CqlInputFormat.java

That should give you what you need to iterate over independent token ranges.

If you want, you can also just divide up the total token range for the
partitioner you are using into equal chunks and have each of your threads
execute a separate scan.

Best regards,
Clint


On Tue, Feb 24, 2015 at 9:50 AM, Gaurav Bhatnagar <ga...@gmail.com>
wrote:

> Hi,
>      I have a cassandra cluster of 3 nodes holding around 300 million rows
> of items. I have a replication factor of 3 with read/write consistency as
> Quorum. I want to scan all rows of database to generate sum of items having
> value "available" in column name state and value "batch1" in column name
> batch. Row key for item is a 15 digit random number.
>     I want to do this processing in multiple threads for instance one
> thread generating sum for one portion of data and other thread generating
> sum for another disjoint portion of data and later I would add up total
> from these 2 threads to get final sum.
>     What can be the possible way to achieve this? Can I use concept of
> virtual nodes here. Each node owns set of virtual nodes.
>      Can I get data owned by a particular node and this way generate sum
> on different nodes by iterating over data from virtual nodes and later
> generate total sum by doing sum of data from all virtual nodes.
>
> Regards,
> Gaurav
>

Re: how to scan all rows of cassandra using multiple threads

Posted by mck <mc...@apache.org>.
>      Can I get data owned by a particular node and this way generate sum
>      on different nodes by iterating over data from virtual nodes and later
> generate total sum by doing sum of data from all virtual nodes.
> 


You're pretty much describing a map/reduce job using CqlInputFormat.