You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ben Hood <0x...@gmail.com> on 2012/09/20 22:29:35 UTC

Using the commit log for external synchronization

Hi,

I'd like to incrementally synchronize data written to Cassandra into
an external store without having to maintain an index to do this, so I
was wondering whether anybody is using the commit log to establish
what updates have taken place since a given point in time?

Cheers,

Ben

Re: Using the commit log for external synchronization

Posted by Ben Hood <0x...@gmail.com>.
Rob,

On Sep 22, 2012, at 0:39, Rob Coli <rc...@palominodb.com> wrote:

> The above gets you most of the way there, but Aaron's point about the
> commitlog not reflecting whether the app met its CL remains true. The
> possibility that Cassandra might coalesce to a value that the
> application does not know was successfully written is one of its known
> edge cases...

Thanks for pointing out the possibility using the replay facility, though I think I'll take on board your observation that the CL is not guaranteed to give me the data I want to get (aside from the fact that you would be building a dependency on an internal API).

Cheers,

Ben

Re: Using the commit log for external synchronization

Posted by Rob Coli <rc...@palominodb.com>.
On Fri, Sep 21, 2012 at 4:31 AM, Ben Hood <0x...@gmail.com> wrote:
> So if I understand you correctly, one shouldn't code against what is
> essentially an internal artefact that could be subject to change as
> the Cassandra code base evolves and furthermore may not contain the
> information an application thinks it should contain.

Pretty much.

> So in summary, given that there is no out of the box way of saying to
> Cassandra "give me all mutations since timestamp X", I would either
> have to go for an event driven approach or reconsider the layout of
> the Cassandra store such that I could reconcile it in an efficient
> fashion.

With :

https://issues.apache.org/jira/browse/CASSANDRA-3690 - "Streaming
CommitLog backup"

You can stream your commitlog off-node as you write it. You can then
restore this commitlog and tell cassandra to replay the commit log
"until" a certain time by using "restore_point_in_time". But...
without :

https://issues.apache.org/jira/browse/CASSANDRA-4392 - "Create a tool
that will convert a commit log into a series of readable CQL
statements"

You are unable to skip bad transactions, so if you want to
roll-forward but skip a TRUNCATE, you are out of luck.

The above gets you most of the way there, but Aaron's point about the
commitlog not reflecting whether the app met its CL remains true. The
possibility that Cassandra might coalesce to a value that the
application does not know was successfully written is one of its known
edge cases...

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: Using the commit log for external synchronization

Posted by Ben Hood <0x...@gmail.com>.
Hi Aaron,

Thanks for your input.

On Fri, Sep 21, 2012 at 9:56 AM, aaron morton <aa...@thelastpickle.com> wrote:
> The commit log is essentially internal implementation. The total size of the
> commit log is restricted, and the multiple files used to represent segments
> are recycled. So once all the memtables have been flushed for segment it may
> be overwritten.
>
> To archive the segments see the conf/commitlog_archiving.properties file.
>
> Large rows will bypass the commit log.
>
> A write commited to the commit log may still be considered a failure if CL
> nodes do not succeed.

So if I understand you correctly, one shouldn't code against what is
essentially an internal artefact that could be subject to change as
the Cassandra code base evolves and furthermore may not contain the
information an application thinks it should contain.

> IMHO it's a better design to multiplex the data stream at the application
> level.

That's a fair point, and I could multicast the data at that level. The
reason why I was considering querying the commit log was because I
would prefer to implement a state based synchronization as opposed to
an event driven synchronization (which is what the app layer multicast
and the AOP solution Brian suggested would be). This is because I'd
rather know from Cassandra what Cassandra thinks it has got, rather
than trusting an event stream who can only infer what information
Cassandra should theoretically hold. The use case I am looking at
should be reconcilable and hence I'm trying to avoid placing trust in
the fact that all of the events were actually sent correctly, arrived
correctly and were written to the target storage without any bugs. I
also want to detect the scenario that portions of the data that was
written to the target system gets accidentally updated or nuked via a
back door.

So in summary, given that there is no out of the box way of saying to
Cassandra "give me all mutations since timestamp X", I would either
have to go for an event driven approach or reconsider the layout of
the Cassandra store such that I could reconcile it in an efficient
fashion.

Thanks for your help,

Cheers,

Ben

Re: Using the commit log for external synchronization

Posted by Ben Hood <0x...@gmail.com>.
Brian,

On Sep 22, 2012, at 1:46, "Brian O'Neill" <bo...@alumni.brown.edu> wrote:

>> IMHO it's a better design to multiplex the data stream at the application
>> level.
> +1, agreed.
> 
> That is where we ended up. (and Storm is proving to be a solid
> framework for that)

Thanks for the heads up, I'll check it out.

Cheers,

Ben

Re: Using the commit log for external synchronization

Posted by Brian O'Neill <bo...@alumni.brown.edu>.
> IMHO it's a better design to multiplex the data stream at the application
> level.
+1, agreed.

That is where we ended up. (and Storm is proving to be a solid
framework for that)

-brian

On Fri, Sep 21, 2012 at 4:56 AM, aaron morton <aa...@thelastpickle.com> wrote:
> The commit log is essentially internal implementation. The total size of the
> commit log is restricted, and the multiple files used to represent segments
> are recycled. So once all the memtables have been flushed for segment it may
> be overwritten.
>
> To archive the segments see the conf/commitlog_archiving.properties file.
>
> Large rows will bypass the commit log.
>
> A write commited to the commit log may still be considered a failure if CL
> nodes do not succeed.
>
> IMHO it's a better design to multiplex the data stream at the application
> level.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 21/09/2012, at 11:51 AM, Brian O'Neill <bo...@alumni.brown.edu> wrote:
>
>
> Along those lines...
>
> We sought to use triggers for external synchronization.   If you read
> through this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-1311
>
> You'll see the idea of leveraging a commit log for synchronization, via
> triggers.
>
> We went ahead and implemented this concept in:
> https://github.com/hmsonline/cassandra-triggers
>
> With that, via AOP, you get handed the mutation as things change.  We used
> it for synchronizing SOLR.
>
> fwiw,
> -brian
>
>
>
> On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:
>
> +1. Would be a pretty cool feature
>
> Right now I write once to cassandra and once to kafka.
>
> On 9/20/12 4:13 PM, "Data Craftsman 木匠" <da...@gmail.com>
> wrote:
>
> This will be a good new feature. I guess the development team don't
>
> have time on this yet.  ;)
>
>
>
> On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x...@gmail.com> wrote:
>
> Hi,
>
>
> I'd like to incrementally synchronize data written to Cassandra into
>
> an external store without having to maintain an index to do this, so I
>
> was wondering whether anybody is using the commit log to establish
>
> what updates have taken place since a given point in time?
>
>
> Cheers,
>
>
> Ben
>
>
>
>
> --
>
> Thanks,
>
>
> Charlie (@mujiang) 木匠
>
> =======
>
> Data Architect Developer 汉唐 田园牧歌DBA
>
> http://mujiang.blogspot.com
>
>
>
> 'Like' us on Facebook for exclusive content and other resources on all
> Barracuda Networks solutions.
> Visit http://barracudanetworks.com/facebook
>
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Re: Using the commit log for external synchronization

Posted by aaron morton <aa...@thelastpickle.com>.
The commit log is essentially internal implementation. The total size of the commit log is restricted, and the multiple files used to represent segments are recycled. So once all the memtables have been flushed for segment it may be overwritten. 

To archive the segments see the conf/commitlog_archiving.properties file. 

Large rows will bypass the commit log. 

A write commited to the commit log may still be considered a failure if CL nodes do not succeed. 

IMHO it's a better design to multiplex the data stream at the application level.   
 
Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 21/09/2012, at 11:51 AM, Brian O'Neill <bo...@alumni.brown.edu> wrote:

> 
> Along those lines...
> 
> We sought to use triggers for external synchronization.   If you read through this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-1311
> 
> You'll see the idea of leveraging a commit log for synchronization, via triggers.
> 
> We went ahead and implemented this concept in:
> https://github.com/hmsonline/cassandra-triggers
> 
> With that, via AOP, you get handed the mutation as things change.  We used it for synchronizing SOLR.  
> 
> fwiw,
> -brian
> 
> 
> 
> On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:
> 
>> +1. Would be a pretty cool feature
>> 
>> Right now I write once to cassandra and once to kafka.
>> 
>> On 9/20/12 4:13 PM, "Data Craftsman 木匠" <da...@gmail.com>
>> wrote:
>> 
>>> This will be a good new feature. I guess the development team don't
>>> have time on this yet.  ;)
>>> 
>>> 
>>> On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x...@gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> I'd like to incrementally synchronize data written to Cassandra into
>>>> an external store without having to maintain an index to do this, so I
>>>> was wondering whether anybody is using the commit log to establish
>>>> what updates have taken place since a given point in time?
>>>> 
>>>> Cheers,
>>>> 
>>>> Ben
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks,
>>> 
>>> Charlie (@mujiang) 木匠
>>> =======
>>> Data Architect Developer 汉唐 田园牧歌DBA
>>> http://mujiang.blogspot.com
>> 
>> 
>> 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
>> Visit http://barracudanetworks.com/facebook
>> 
>> 
> 
> -- 
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
> 


Re: Using the commit log for external synchronization

Posted by Brian O'Neill <bo...@alumni.brown.edu>.
Along those lines...

We sought to use triggers for external synchronization.   If you read through this issue:
https://issues.apache.org/jira/browse/CASSANDRA-1311

You'll see the idea of leveraging a commit log for synchronization, via triggers.

We went ahead and implemented this concept in:
https://github.com/hmsonline/cassandra-triggers

With that, via AOP, you get handed the mutation as things change.  We used it for synchronizing SOLR.  

fwiw,
-brian



On Sep 20, 2012, at 7:18 PM, Michael Kjellman wrote:

> +1. Would be a pretty cool feature
> 
> Right now I write once to cassandra and once to kafka.
> 
> On 9/20/12 4:13 PM, "Data Craftsman 木匠" <da...@gmail.com>
> wrote:
> 
>> This will be a good new feature. I guess the development team don't
>> have time on this yet.  ;)
>> 
>> 
>> On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I'd like to incrementally synchronize data written to Cassandra into
>>> an external store without having to maintain an index to do this, so I
>>> was wondering whether anybody is using the commit log to establish
>>> what updates have taken place since a given point in time?
>>> 
>>> Cheers,
>>> 
>>> Ben
>> 
>> 
>> 
>> -- 
>> Thanks,
>> 
>> Charlie (@mujiang) 木匠
>> =======
>> Data Architect Developer 汉唐 田园牧歌DBA
>> http://mujiang.blogspot.com
> 
> 
> 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
> Visit http://barracudanetworks.com/facebook
> 
> 

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Using the commit log for external synchronization

Posted by Michael Kjellman <mk...@barracuda.com>.
+1. Would be a pretty cool feature

Right now I write once to cassandra and once to kafka.

On 9/20/12 4:13 PM, "Data Craftsman 木匠" <da...@gmail.com>
wrote:

>This will be a good new feature. I guess the development team don't
>have time on this yet.  ;)
>
>
>On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x...@gmail.com> wrote:
>> Hi,
>>
>> I'd like to incrementally synchronize data written to Cassandra into
>> an external store without having to maintain an index to do this, so I
>> was wondering whether anybody is using the commit log to establish
>> what updates have taken place since a given point in time?
>>
>> Cheers,
>>
>> Ben
>
>
>
>-- 
>Thanks,
>
>Charlie (@mujiang) 木匠
>=======
>Data Architect Developer 汉唐 田园牧歌DBA
>http://mujiang.blogspot.com


'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook



Re: Using the commit log for external synchronization

Posted by Data Craftsman 木匠 <da...@gmail.com>.
This will be a good new feature. I guess the development team don't
have time on this yet.  ;)


On Thu, Sep 20, 2012 at 1:29 PM, Ben Hood <0x...@gmail.com> wrote:
> Hi,
>
> I'd like to incrementally synchronize data written to Cassandra into
> an external store without having to maintain an index to do this, so I
> was wondering whether anybody is using the commit log to establish
> what updates have taken place since a given point in time?
>
> Cheers,
>
> Ben



-- 
Thanks,

Charlie (@mujiang) 木匠
=======
Data Architect Developer 汉唐 田园牧歌DBA
http://mujiang.blogspot.com