You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by William Katsak <wk...@cs.rutgers.edu> on 2013/04/09 23:16:35 UTC

Re: Streaming RowMutations (and possibly merging them)

Hello,

I apologize for my very vague email, I shouldn't have written it in such 
a hurry. I would like to clarify my use case and requirements, so that 
maybe someone can give me some advice.

I am building a research version of Cassandra in which a missed write is 
a normal case (e.g. out of n replicas, it would be a normal case for at 
least one of these to miss a write). I keep track of missed writes 
similar to how default Cassandra does for HintedHandoff (a column family 
in system that stores serialized RowMutations). Later, when the nodes 
that were missed are ready to receive writes again, the node caching the 
RowMutations sends them one a a time until they have all been delivered. 
This all happens in the context of a live, serving system.

My system works and does what it is supposed to, now I am trying to 
improve performance. I currently have two optimizations in mind, but am 
not sure how to approach them:

1) Minimize the transfer of excessive RowMutations by merging all 
RowMutations for the same key, and transmitting only one per key. In the 
event that a subset of keys are very popular, I can minimize how much I 
need to transfer to bring a node back up to date. I am thinking I can go 
inside the RowMutation and merge each ColumnFamily, then create a new 
RowMutation with the merged CFs. Is ColumnFamily.diff() the right way to 
merge an invididual CF, or am I misunderstanding it?

2) Serialize a whole bunch of RowMutations into a chunk, stream the 
chunk to the appropriate node, deserialize them, and apply them 
individually. In this case, I would avoid having to wait for an ACK on 
each mutation, and could more efficiently send lots of data. Is this 
feasible with the existing streaming infrastructure, or would I have to 
implement a new facility?

Again, my codebase is on top of Cassandra 1.1.6. I would very much 
appreciate any insight anyone could give me.

Thanks very much,
Bill Katsak

On 04/08/2013 12:10 PM, William Katsak wrote:
> Hello,
>
> I am sorry to bother the list with this question, but I was wondering, 
> assuming I have many saved (small) mutations (of the type that hinted 
> handoff uses), is there any easy way to put these all together and 
> bulk transmit (stream) them to a destination node?
>
> My codebase is based on Cassandra 1.1.6.
>
> Thanks very much in advance,
> Bill Katsak
>
>
>


Re: Streaming RowMutations (and possibly merging them)

Posted by Jonathan Ellis <jb...@gmail.com>.
You can probably leverage the bulk writer API.  Look at
SSTableSimpleUnsortedWriter for example.


On Tue, Apr 9, 2013 at 4:16 PM, William Katsak <wk...@cs.rutgers.edu>wrote:

> Hello,
>
> I apologize for my very vague email, I shouldn't have written it in such a
> hurry. I would like to clarify my use case and requirements, so that maybe
> someone can give me some advice.
>
> I am building a research version of Cassandra in which a missed write is a
> normal case (e.g. out of n replicas, it would be a normal case for at least
> one of these to miss a write). I keep track of missed writes similar to how
> default Cassandra does for HintedHandoff (a column family in system that
> stores serialized RowMutations). Later, when the nodes that were missed are
> ready to receive writes again, the node caching the RowMutations sends them
> one a a time until they have all been delivered. This all happens in the
> context of a live, serving system.
>
> My system works and does what it is supposed to, now I am trying to
> improve performance. I currently have two optimizations in mind, but am not
> sure how to approach them:
>
> 1) Minimize the transfer of excessive RowMutations by merging all
> RowMutations for the same key, and transmitting only one per key. In the
> event that a subset of keys are very popular, I can minimize how much I
> need to transfer to bring a node back up to date. I am thinking I can go
> inside the RowMutation and merge each ColumnFamily, then create a new
> RowMutation with the merged CFs. Is ColumnFamily.diff() the right way to
> merge an invididual CF, or am I misunderstanding it?
>
> 2) Serialize a whole bunch of RowMutations into a chunk, stream the chunk
> to the appropriate node, deserialize them, and apply them individually. In
> this case, I would avoid having to wait for an ACK on each mutation, and
> could more efficiently send lots of data. Is this feasible with the
> existing streaming infrastructure, or would I have to implement a new
> facility?
>
> Again, my codebase is on top of Cassandra 1.1.6. I would very much
> appreciate any insight anyone could give me.
>
> Thanks very much,
> Bill Katsak
>
> On 04/08/2013 12:10 PM, William Katsak wrote:
>
>> Hello,
>>
>> I am sorry to bother the list with this question, but I was wondering,
>> assuming I have many saved (small) mutations (of the type that hinted
>> handoff uses), is there any easy way to put these all together and bulk
>> transmit (stream) them to a destination node?
>>
>> My codebase is based on Cassandra 1.1.6.
>>
>> Thanks very much in advance,
>> Bill Katsak
>>
>>
>>
>>
>


-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced