You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joshua McKenzie (JIRA)" <ji...@apache.org> on 2015/09/18 19:36:06 UTC

[jira] [Commented] (CASSANDRA-8844) Change Data Capture (CDC)

    [ https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876020#comment-14876020 ] 

Joshua McKenzie commented on CASSANDRA-8844:
--------------------------------------------

Update on current status - DDL/CQL changes are in, as is schema. I nested the set of DC's in TableParams.

When considering writing to a CL and CDC log atomically, I run into the fact that it's [very hard to do|https://commons.apache.org/proper/commons-transaction/], if not impossible on top of current filesystems.

If we keep a separate CDC log, we have the following design options with the following problems:
h6. 1: Write to CL, then write to CDC log, throw UE if CDC Log write fails
# If CDC log write fails and we throw UE, we still have data in the CL that can be replayed on node restart with no CDC record, and we have CL/memtable mismatch.
## We could also replay CDC information on commit log replay, but we'd have to pull that data from 2 sources
## We duplicate a lot of work, even if it's just decorating CL serialization w/Role.

h6. 2: Write to CDC log, then write to CL, if CL fail, write redact record to CDC
# We want to write to CL 1st. Adding another step before writing CL means reducing data integrity/speed to on-disk
# Still duplicating a lot of I/O
# Consumers have to deal with CDC record invalidation

----
If, however, we serialize Role information into the CL itself and have CDC clients parse CommitLogs, that opens up the following:
h6. 1: If on a table w/CDC, serialize Role into CL. Consumer reads CL for CDC information and marks the entire CL as complete when all records are processed
# Slow consumer could kill nodes w/CL space being used up.
# Node's notion of flushing when CL nearing limit would be invalid. Slow consumer would lead to used CL space being full and node would loop on attempting to flush memtable to free up CL space.

And the final option that I *think* gives us the best of all worlds:
h6. 2:  Add custom post-processing on memtable flush: If on a table w/CDC, on write serialize Role into CL. Consumer reads CL for CDC info, marks entire CL complete when records are processed. On memtable flush, if CL not processed by CDC, it's moved into a logical "CDC-overflow" pool w/its own configurable size limit. Writes to CDC-enabled tables either throw UE or just drop old CDC logs depending on .yaml
# Can't think of any major problems with this. Yet.
(note: Rather than "if not processed by CDC", it probably makes more sense to flag a CL as containing CDC records or not, and during flush if a segment is marked as containing CDC-data, just leave the file for a CDC consumer to clean up and add its size to the CDC-overflow counter)

The final option gives us:
* ConsistencyLevel guarantee - successful writes to CL are guaranteed to have a corresponding logical CDC-entry, as it's the write itself
* performance profile - performance is only slower if you have CDC enabled on a table, only slower by the time it takes to serialize the extra role data
* the best "downtime" logical profile - writes to CDC-enabled tables only are impacted by slow consumer, not entire node
* low disk overhead - configurable space for the CDC Overflow pool used when consumers are behind memtable flushing. Only writing delta of Role rather than serializing the entire CL record.

CASSANDRA-7075 would further help mitigate any extra overhead of extra serialization, however this should not be necessary for this to work.

Any thoughts on the last idea (or any of them really) are greatly appreciated.

> Change Data Capture (CDC)
> -------------------------
>
>                 Key: CASSANDRA-8844
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Joshua McKenzie
>            Priority: Critical
>             Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for mission critical data in large enterprises, it is increasingly being called upon to act as the central hub of traffic and data flow to other systems. In order to try to address the general need, we (cc [~brianmhess]), propose implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its Consistency Level semantics, and in order to treat it as the single reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an easy enhancement, but most likely use cases would prefer to only implement CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the commitlog, failure to write to the CDC log should fail that node's write. If that means the requested consistency level was not met, then clients *should* experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also believe that they can be deferred for a subsequent release, and to guage actual interest.
> - Multiple logs per table. This would make it easy to have multiple "subscribers" to a single table's changes. A workaround would be to create a forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters would make Casandra a much more versatile feeder into other systems, and again, reduce complexity that would otherwise need to be built into the daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming schema, making it triivial to process them in order.
> - Daemons should be able to checkpoint their work, and resume from where they left off. This means they would have to leave some file artifact in the CDC log's directory.
> - A sophisticated daemon should be able to be written that could 
> -- Catch up, in written-order, even when it is multiple logfiles behind in processing
> -- Be able to continuously "tail" the most recent logfile and get low-latency(ms?) access to the data as it is written.
> h2. Alternate approach
> In order to make consuming a change log easy and efficient to do with low latency, the following could supplement the approach outlined above
> - Instead of writing to a logfile, by default, Cassandra could expose a socket for a daemon to connect to, and from which it could pull each row.
> - Cassandra would have a limited buffer for storing rows, should the listener become backlogged, but it would immediately spill to disk in that case, never incurring large in-memory costs.
> h2. Additional consumption possibility
> With all of the above, still relevant:
> - instead (or in addition to) using the other logging mechanisms, use CQL transport itself as a logger.
> - Extend the CQL protoocol slightly so that rows of data can be return to a listener that didn't explicit make a query, but instead registered itself with Cassandra as a listener for a particular event type, and in this case, the event type would be anything that would otherwise go to a CDC log.
> - If there is no listener for the event type associated with that log, or if that listener gets backlogged, the rows will again spill to the persistent storage.
> h2. Possible Syntax
> {code:sql}
> CREATE TABLE ... WITH CDC LOG
> {code}
> Pros: No syntax extesions
> Cons: doesn't make it easy to capture the various permutations (i'm happy to be proven wrong) of per-dc logging. also, the hypothetical multiple logs per table would break this
> {code:sql}
> CREATE CDC_LOG mylog ON mytable WHERE MyUdf(mycol1, mycol2) = 5 with DCs={'dc1','dc3'}
> {code}
> Pros: Expressive and allows for easy DDL management of all aspects of CDC
> Cons: Syntax additions. Added complexity, partly for features that might not be implemented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)