You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Philip Nelson <ph...@yahoo.com> on 2012/08/03 20:18:06 UTC

How to process new rows in parallel?

Hello,

I am using a Column Family in Cassandra to store incoming messages, which arrive at a high rate (100s of thousands per second). I then have a process wake up periodically to work on those messages, and then delete them. I'd like to understand how I could have multiple processes running, each pulling off a bunch of messages in parallel. It would be nice to be able to add processes dynamically, and not have to explicitly assign message ranges to various processes.

Any suggestions on how to ensure that each process pulls off a different bunch of messages? Any recommended design patterns? I was going to look at qsandra too, for inspiration. Would this be worthwhile?

If this was a relational database, I would have the processes lock the table (or perhaps a row), set flags on a row indicating that it's being "processed", and then unlock. Processes would choose messages by SELECTing on unflagged messages. I'm not sure how this might map to Cassandra. I realise it may not. Even if I configure the cluster such that seting a flag on a row requires all nodes to be written, two processes could still race setting that flag, right?

I am open to the idea that it might help to store the messages in wide rows, if that helps.

Thanks,

Philip

Re: How to process new rows in parallel?

Posted by Milind Parikh <mi...@gmail.com>.

Kafka is relatively stable and has a active well-supported news-group as
well.

As discussed by Brian, you would be inverting the paradigm of
store-process. Essentially in your original approach, you are storing the
messages first and then processing them after the fact. In the Kafka model,
you would process the messages as they come in.

Since you are thinking about parallelism anyways, I trust that your
processing paradigm is inherently paralleizable.

Regards
Milind

On Fri, Aug 3, 2012 at 12:22 PM, Philip Nelson <
philipomailbox-cass@yahoo.com> wrote:

> Brian -- thanks.
>
> > We were looking to do the same thing, but in the end decided
> > to go with Kafka.
> > Given your throughput requirements, Kafka might be a good
> > option for you as well.
>
> This might be off-topic, so I'll keep it short. Kafka is reasonably
> stable? Mature (I see it's in the Incubator)? Relative to Cassandra?
>
> Philip
>
>
>

Re: How to process new rows in parallel?

Posted by Philip Nelson <ph...@yahoo.com>.

Brian -- thanks.

> We were looking to do the same thing, but in the end decided
> to go with Kafka.
> Given your throughput requirements, Kafka might be a good
> option for you as well.

This might be off-topic, so I'll keep it short. Kafka is reasonably stable? Mature (I see it's in the Incubator)? Relative to Cassandra?

Philip

Re: How to process new rows in parallel?

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

If you are deleting the messages after processing, it sounds like you
are using Cassandra as a work queue.

Here are some links for implementing a distributed queue in Cassandra:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
http://comments.gmane.org/gmane.comp.db.cassandra.user/16633

There is a placeholder on the use cases wiki for this, but no info:
http://wiki.apache.org/cassandra/UseCases#A_distributed_Priority_Job_Queue

We were looking to do the same thing, but in the end decided to go with Kafka.
Given your throughput requirements, Kafka might be a good option for
you as well.

-brian


On Fri, Aug 3, 2012 at 2:18 PM, Philip Nelson
<ph...@yahoo.com> wrote:
> Hello,
>
> I am using a Column Family in Cassandra to store incoming messages, which arrive at a high rate (100s of thousands per second). I then have a process wake up periodically to work on those messages, and then delete them. I'd like to understand how I could have multiple processes running, each pulling off a bunch of messages in parallel. It would be nice to be able to add processes dynamically, and not have to explicitly assign message ranges to various processes.
>
> Any suggestions on how to ensure that each process pulls off a different bunch of messages? Any recommended design patterns? I was going to look at qsandra too, for inspiration. Would this be worthwhile?
>
> If this was a relational database, I would have the processes lock the table (or perhaps a row), set flags on a row indicating that it's being "processed", and then unlock. Processes would choose messages by SELECTing on unflagged messages. I'm not sure how this might map to Cassandra. I realise it may not. Even if I configure the cluster such that seting a flag on a row requires all nodes to be written, two processes could still race setting that flag, right?
>
> I am open to the idea that it might help to store the messages in wide rows, if that helps.
>
> Thanks,
>
> Philip



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/