You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/06/07 00:27:13 UTC

Data model for streaming a large table in real time.

We have the requirement to have clients read from our tables while they're
being written.

Basically, any write that we make to cassandra needs to be sent out over
the Internet to our customers.

We also need them to resume so if they go offline, they can just pick up
where they left off.

They need to do this in parallel, so if we have 20 cassandra nodes, they
can have 20 readers each efficiently (and without coordination) reading
from our tables.

Here's how we're planning on doing it.

We're going to use the ByteOrderedPartitioner .

I'm writing with a primary key of the timestamp, however, in practice, this
would yield hotspots.

(I'm also aware that time isn't a very good pk in a distribute system as I
can easily have a collision so we're going to use a scheme similar to a
uuid to make it unique per writer).

One node would take all the load, followed by the next node, etc.

So my plan to stop this is to prefix a slice ID to the timestamp.  This way
each piece of content has a unique ID, but the prefix will place it on a
node.

The slide ID is just a byte… so this means there are 255 buckets in which I
can place data.

This means I can have clients each start with a slice, and a timestamp, and
page through the data with tokens.

This way I can have a client reading with 255 threads from 255 regions in
the cluster, in parallel, without any hot spots.

Thoughts on this strategy?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Posted by DuyHai Doan <do...@gmail.com>.

 "One node would take all the load, followed by the next node" --> with
this design, you are not exploiting all the power of the cluster. If only
one node takes all the load at a time, what is the point having 20 or 10
nodes ?

 You'd better off using limited wide row with bucketing to achieve this.

 You can have a look at this past thread, it may give you some ideas:
https://www.mail-archive.com/user@cassandra.apache.org/msg35666.html




On Sat, Jun 7, 2014 at 12:27 AM, Kevin Burton <bu...@spinn3r.com> wrote:

> We have the requirement to have clients read from our tables while they're
> being written.
>
> Basically, any write that we make to cassandra needs to be sent out over
> the Internet to our customers.
>
> We also need them to resume so if they go offline, they can just pick up
> where they left off.
>
> They need to do this in parallel, so if we have 20 cassandra nodes, they
> can have 20 readers each efficiently (and without coordination) reading
> from our tables.
>
> Here's how we're planning on doing it.
>
> We're going to use the ByteOrderedPartitioner .
>
> I'm writing with a primary key of the timestamp, however, in practice,
> this would yield hotspots.
>
> (I'm also aware that time isn't a very good pk in a distribute system as I
> can easily have a collision so we're going to use a scheme similar to a
> uuid to make it unique per writer).
>
> One node would take all the load, followed by the next node, etc.
>
> So my plan to stop this is to prefix a slice ID to the timestamp.  This
> way each piece of content has a unique ID, but the prefix will place it on
> a node.
>
> The slide ID is just a byte… so this means there are 255 buckets in which
> I can place data.
>
> This means I can have clients each start with a slice, and a timestamp,
> and page through the data with tokens.
>
> This way I can have a client reading with 255 threads from 255 regions in
> the cluster, in parallel, without any hot spots.
>
> Thoughts on this strategy?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>

Re: Data model for streaming a large table in real time.

Posted by Colin Clark <co...@clark.ws>.

Write Consistency Level + Read Consistency Level > Replication Factor
ensure your reads will read consistently and having 3 nodes lets you
achieve redundancy in event of node failure.

So writing with CL of local quorum and reading with CL of local quorum
(2+2>3) with replication factor of 3 ensures reads and protection against
losing a node.

In event of losing a node, you can downgrade the CL automatically and then
also accept a little eventual consistency.


--
Colin
320-221-9531


On Jun 7, 2014, at 10:03 PM, James Campbell <ja...@breachintelligence.com>
wrote:

 This is a basic question, but having heard that advice before, I'm curious
about why the minimum recommended replication factor is three? Certainly
additional redundancy, and, I believe, a minimum threshold for paxos. Are
there other reasons?
On Jun 7, 2014 10:52 PM, Colin <co...@gmail.com> wrote:
 To have any redundancy in the system, start with at least 3 nodes and a
replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
data.

 Will you be replicating data across data centers?

-- 
Colin
320-221-9531


On Jun 7, 2014, at 9:40 PM, Kevin Burton <bu...@spinn3r.com> wrote:

  Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
buckets .. 0-99… then the timestamp under that..  I think it should be fine
and won't require an ordered partitioner. :)

 Thanks!


On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws> wrote:

>  With 100 nodes, that ingestion rate is actually quite low and I don't
> think you'd need another column in the partition key.
>
>  You seem to be set in your current direction.  Let us know how it works
> out.
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 9:18 PM, Kevin Burton <bu...@spinn3r.com> wrote:
>
>   What's 'source' ? You mean like the URL?
>
>  If source too random it's going to yield too many buckets.
>
>  Ingestion rates are fairly high but not insane.  About 4M inserts per
> hour.. from 5-10GB…
>
>
> On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws> wrote:
>
>>  Not if you add another column to the partition key; source for example.
>>
>>
>>  I would really try to stay away from the ordered partitioner if at all
>> possible.
>>
>>  What ingestion rates are you expecting, in size and speed.
>>
>> --
>> Colin
>> 320-221-9531
>>
>>
>> On Jun 7, 2014, at 9:05 PM, Kevin Burton <bu...@spinn3r.com> wrote:
>>
>>
>>  Thanks for the feedback on this btw.. .it's helpful.  My notes below.
>>
>> On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws> wrote:
>>
>>>  No, you're not-the partition key will get distributed across the
>>> cluster if you're using random or murmur.
>>>
>>
>>  Yes… I'm aware.  But in practice this is how it will work…
>>
>>  If we create bucket b0, that will get hashed to h0…
>>
>>  So say I have 50 machines performing writes, they are all on the same
>> time thanks to ntpd, so they all compute b0 for the current bucket based on
>> the time.
>>
>>  That gets hashed to h0…
>>
>>  If h0 is hosted on node0 … then all writes go to node zero for that 1
>> second interval.
>>
>>  So all my writes are bottlenecking on one node.  That node is
>> *changing* over time… but they're not being dispatched in parallel over N
>> nodes.  At most writes will only ever reach 1 node a time.
>>
>>
>>
>>>  You could also ensure that by adding another column, like source to
>>> ensure distribution. (Add the seconds to the partition key, not the
>>> clustering columns)
>>>
>>>  I can almost guarantee that if you put too much thought into working
>>> against what Cassandra offers out of the box, that it will bite you later.
>>>
>>>
>>  Sure.. I'm trying to avoid the 'bite you later' issues. More so because
>> I'm sure there are Cassandra gotchas to worry about.  Everything has them.
>>  Just trying to avoid the land mines :-P
>>
>>
>>>  In fact, the use case that you're describing may best be served by a
>>> queuing mechanism, and using Cassandra only for the underlying store.
>>>
>>
>>  Yes… that's what I'm doing.  We're using apollo to fan out the queue,
>> but the writes go back into cassandra and needs to be read out sequentially.
>>
>>
>>>
>>>  I used this exact same approach in a use case that involved writing
>>> over a million events/second to a cluster with no problems.  Initially, I
>>> thought ordered partitioner was the way to go too.  And I used separate
>>> processes to aggregate, conflate, and handle distribution to clients.
>>>
>>
>>
>>  Yes. I think using 100 buckets will work for now.  Plus I don't have to
>> change the partitioner on our existing cluster and I'm lazy :)
>>
>>
>>>
>>>  Just my two cents, but I also spend the majority of my days helping
>>> people utilize Cassandra correctly, and rescuing those that haven't.
>>>
>>>
>>  Definitely appreciate the feedback!  Thanks!
>>
>>  --
>>
>>  Founder/CEO Spinn3r.com
>>  Location: *San Francisco, CA*
>> Skype: *burtonator*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>  <http://spinn3r.com>
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
>> people.
>>
>>
>
>
>  --
>
>  Founder/CEO Spinn3r.com
>  Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>  <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
 <http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

Posted by James Campbell <ja...@breachintelligence.com>.

This is a basic question, but having heard that advice before, I'm curious about why the minimum recommended replication factor is three? Certainly additional redundancy, and, I believe, a minimum threshold for paxos. Are there other reasons?

On Jun 7, 2014 10:52 PM, Colin <co...@gmail.com> wrote:
To have any redundancy in the system, start with at least 3 nodes and a replication factor of 3.

Try to have at least 8 cores, 32 gig ram, and separate disks for log and data.

Will you be replicating data across data centers?

--
Colin
320-221-9531

On Jun 7, 2014, at 9:40 PM, Kevin Burton <bu...@spinn3r.com>> wrote:

Oh.. To start with we're going to use from 2-10 nodes..

I think we're going to take the original strategy and just to use 100 buckets .. 0-99… then the timestamp under that.. I think it should be fine and won't require an ordered partitioner. :)

Thanks!

On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark <co...@clark.ws>> wrote:
With 100 nodes, that ingestion rate is actually quite low and I don't think you'd need another column in the partition key.

You seem to be set in your current direction. Let us know how it works out.

--
Colin
320-221-9531<tel:320-221-9531>

On Jun 7, 2014, at 9:18 PM, Kevin Burton <bu...@spinn3r.com>> wrote:

What's 'source' ? You mean like the URL?

If source too random it's going to yield too many buckets.

Ingestion rates are fairly high but not insane. About 4M inserts per hour.. from 5-10GB…

On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark <co...@clark.ws>> wrote:
Not if you add another column to the partition key; source for example.

I would really try to stay away from the ordered partitioner if at all possible.

What ingestion rates are you expecting, in size and speed.

--
Colin
320-221-9531<tel:320-221-9531>

On Jun 7, 2014, at 9:05 PM, Kevin Burton <bu...@spinn3r.com>> wrote:

Thanks for the feedback on this btw.. .it's helpful. My notes below.

On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark <co...@clark.ws>> wrote:
No, you're not-the partition key will get distributed across the cluster if you're using random or murmur.

Yes… I'm aware. But in practice this is how it will work…

If we create bucket b0, that will get hashed to h0…

So say I have 50 machines performing writes, they are all on the same time thanks to ntpd, so they all compute b0 for the current bucket based on the time.

That gets hashed to h0…

If h0 is hosted on node0 … then all writes go to node zero for that 1 second interval.

So all my writes are bottlenecking on one node. That node is *changing* over time… but they're not being dispatched in parallel over N nodes. At most writes will only ever reach 1 node a time.

You could also ensure that by adding another column, like source to ensure distribution. (Add the seconds to the partition key, not the clustering columns)

I can almost guarantee that if you put too much thought into working against what Cassandra offers out of the box, that it will bite you later.

Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm sure there are Cassandra gotchas to worry about. Everything has them. Just trying to avoid the land mines :-P

In fact, the use case that you're describing may best be served by a queuing mechanism, and using Cassandra only for the underlying store.

Yes… that's what I'm doing. We're using apollo to fan out the queue, but the writes go back into cassandra and needs to be read out sequentially.

I used this exact same approach in a use case that involved writing over a million events/second to a cluster with no problems. Initially, I thought ordered partitioner was the way to go too. And I used separate processes to aggregate, conflate, and handle distribution to clients.

Yes. I think using 100 buckets will work for now. Plus I don't have to change the partitioner on our existing cluster and I'm lazy :)

Just my two cents, but I also spend the majority of my days helping people utilize Cassandra correctly, and rescuing those that haven't.

Definitely appreciate the feedback! Thanks!

Founder/CEO Spinn3r.com<http://Spinn3r.com>
Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile<https://plus.google.com/102718274791889610666/posts>
[http://spinn3r.com/images/spinn3r.jpg]<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.