You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Andrew Prudhomme <as...@yelp.com> on 2018/01/31 01:10:39 UTC

CDC usability and future development

Hi all,

We are currently designing a system that allows our Cassandra clusters to
produce a stream of data updates. Naturally, we have been evaluating if CDC
can aid in this endeavor. We have found several challenges in using CDC for
this purpose.

CDC provides only the mutation as opposed to the full column value, which
tends to be of limited use for us. Applications might want to know the full
column value, without having to issue a read back. We also see value in
being able to publish the full column value both before and after the
update. This is especially true when deleting a column since this stream
may be joined with others, or consumers may require other fields to
properly process the delete.

Additionally, there is some difficulty with processing CDC itself such as:
- Updates not being immediately available (addressed by CASSANDRA-12148)
- Each node providing an independent streams of updates that must be
unified and deduplicated

Our question is, what is the vision for CDC development? The current
implementation could work for some use cases, but is a ways from a general
streaming solution. I understand that the nature of Cassandra makes this
quite complicated, but are there any thoughts or desires on the future
direction of CDC?

Thanks

Re: CDC usability and future development

Posted by Jay Zhuang <zj...@uber.com>.

We did a POC to improve CDC feature as an interface (
https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf),
so the user doesn't have to read the commit log directly. We deployed the
change to a test cluster and doing more tests for production traffics, will
send out the design proposal, POC and the test result soon.

We have the same problem to get the full row value for CDC downstream
pipeline. We used to do a readback, right now our CDC downstream stores all
the data (in Hive), so no need to read back. For Cassandra CDC feature, I
don't think it should provide the full row value, as it supposed to be
Change Data Capture. But it's still a problem for the range delete, as it
cannot read back deleted data. So we're purposing an option to expand the
range delete in CDC if the user really wants it.


On Wed, Jan 31, 2018 at 7:32 AM Josh McKenzie <jm...@apache.org> wrote:

> CDC provides only the mutation as opposed to the full column value, which
>> tends to be of limited use for us. Applications might want to know the full
>> column value, without having to issue a read back. We also see value in
>> being able to publish the full column value both before and after the
>> update. This is especially true when deleting a column since this stream
>> may be joined with others, or consumers may require other fields to
>> properly process the delete.
>
>
> Philosophically, my first pass at the feature prioritized minimizing
> impact to node performance first and usability second, punting a lot of the
> de-duplication and RbW implications of having full column values, or
> materializing stuff off-heap for consumption from a user and flagging as
> persisted to disk etc, for future work on the feature. I don't personally
> have any time to devote to moving the feature forward now but as Jeff
> indicates, Jay and Simon are both active in the space and taking up the
> torch.
>
>
> On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>
>> Here's a deck of some proposed additions, discussed at one of the NGCC
>> sessions last fall:
>>
>> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>>
>>
>>
>> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:
>>
>> > Hi all,
>> >
>> > We are currently designing a system that allows our Cassandra clusters
>> to
>> > produce a stream of data updates. Naturally, we have been evaluating if
>> CDC
>> > can aid in this endeavor. We have found several challenges in using CDC
>> for
>> > this purpose.
>> >
>> > CDC provides only the mutation as opposed to the full column value,
>> which
>> > tends to be of limited use for us. Applications might want to know the
>> full
>> > column value, without having to issue a read back. We also see value in
>> > being able to publish the full column value both before and after the
>> > update. This is especially true when deleting a column since this stream
>> > may be joined with others, or consumers may require other fields to
>> > properly process the delete.
>> >
>> > Additionally, there is some difficulty with processing CDC itself such
>> as:
>> > - Updates not being immediately available (addressed by CASSANDRA-12148)
>> > - Each node providing an independent streams of updates that must be
>> > unified and deduplicated
>> >
>> > Our question is, what is the vision for CDC development? The current
>> > implementation could work for some use cases, but is a ways from a
>> general
>> > streaming solution. I understand that the nature of Cassandra makes this
>> > quite complicated, but are there any thoughts or desires on the future
>> > direction of CDC?
>> >
>> > Thanks
>> >
>> >
>>
>
>

Re: CDC usability and future development

Posted by Jay Zhuang <zj...@uber.com>.

We did a POC to improve CDC feature as an interface (
https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf),
so the user doesn't have to read the commit log directly. We deployed the
change to a test cluster and doing more tests for production traffics, will
send out the design proposal, POC and the test result soon.

We have the same problem to get the full row value for CDC downstream
pipeline. We used to do a readback, right now our CDC downstream stores all
the data (in Hive), so no need to read back. For Cassandra CDC feature, I
don't think it should provide the full row value, as it supposed to be
Change Data Capture. But it's still a problem for the range delete, as it
cannot read back deleted data. So we're purposing an option to expand the
range delete in CDC if the user really wants it.


On Wed, Jan 31, 2018 at 7:32 AM Josh McKenzie <jm...@apache.org> wrote:

> CDC provides only the mutation as opposed to the full column value, which
>> tends to be of limited use for us. Applications might want to know the full
>> column value, without having to issue a read back. We also see value in
>> being able to publish the full column value both before and after the
>> update. This is especially true when deleting a column since this stream
>> may be joined with others, or consumers may require other fields to
>> properly process the delete.
>
>
> Philosophically, my first pass at the feature prioritized minimizing
> impact to node performance first and usability second, punting a lot of the
> de-duplication and RbW implications of having full column values, or
> materializing stuff off-heap for consumption from a user and flagging as
> persisted to disk etc, for future work on the feature. I don't personally
> have any time to devote to moving the feature forward now but as Jeff
> indicates, Jay and Simon are both active in the space and taking up the
> torch.
>
>
> On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa <jj...@gmail.com> wrote:
>
>> Here's a deck of some proposed additions, discussed at one of the NGCC
>> sessions last fall:
>>
>> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>>
>>
>>
>> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:
>>
>> > Hi all,
>> >
>> > We are currently designing a system that allows our Cassandra clusters
>> to
>> > produce a stream of data updates. Naturally, we have been evaluating if
>> CDC
>> > can aid in this endeavor. We have found several challenges in using CDC
>> for
>> > this purpose.
>> >
>> > CDC provides only the mutation as opposed to the full column value,
>> which
>> > tends to be of limited use for us. Applications might want to know the
>> full
>> > column value, without having to issue a read back. We also see value in
>> > being able to publish the full column value both before and after the
>> > update. This is especially true when deleting a column since this stream
>> > may be joined with others, or consumers may require other fields to
>> > properly process the delete.
>> >
>> > Additionally, there is some difficulty with processing CDC itself such
>> as:
>> > - Updates not being immediately available (addressed by CASSANDRA-12148)
>> > - Each node providing an independent streams of updates that must be
>> > unified and deduplicated
>> >
>> > Our question is, what is the vision for CDC development? The current
>> > implementation could work for some use cases, but is a ways from a
>> general
>> > streaming solution. I understand that the nature of Cassandra makes this
>> > quite complicated, but are there any thoughts or desires on the future
>> > direction of CDC?
>> >
>> > Thanks
>> >
>> >
>>
>
>

Re: CDC usability and future development

Posted by Josh McKenzie <jm...@apache.org>.

>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.


Philosophically, my first pass at the feature prioritized minimizing impact
to node performance first and usability second, punting a lot of the
de-duplication and RbW implications of having full column values, or
materializing stuff off-heap for consumption from a user and flagging as
persisted to disk etc, for future work on the feature. I don't personally
have any time to devote to moving the feature forward now but as Jeff
indicates, Jay and Simon are both active in the space and taking up the
torch.


On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa <jj...@gmail.com> wrote:

> Here's a deck of some proposed additions, discussed at one of the NGCC
> sessions last fall:
>
> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>
>
>
> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:
>
> > Hi all,
> >
> > We are currently designing a system that allows our Cassandra clusters to
> > produce a stream of data updates. Naturally, we have been evaluating if
> CDC
> > can aid in this endeavor. We have found several challenges in using CDC
> for
> > this purpose.
> >
> > CDC provides only the mutation as opposed to the full column value, which
> > tends to be of limited use for us. Applications might want to know the
> full
> > column value, without having to issue a read back. We also see value in
> > being able to publish the full column value both before and after the
> > update. This is especially true when deleting a column since this stream
> > may be joined with others, or consumers may require other fields to
> > properly process the delete.
> >
> > Additionally, there is some difficulty with processing CDC itself such
> as:
> > - Updates not being immediately available (addressed by CASSANDRA-12148)
> > - Each node providing an independent streams of updates that must be
> > unified and deduplicated
> >
> > Our question is, what is the vision for CDC development? The current
> > implementation could work for some use cases, but is a ways from a
> general
> > streaming solution. I understand that the nature of Cassandra makes this
> > quite complicated, but are there any thoughts or desires on the future
> > direction of CDC?
> >
> > Thanks
> >
> >
>

Re: CDC usability and future development

Posted by Josh McKenzie <jm...@apache.org>.

>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.


Philosophically, my first pass at the feature prioritized minimizing impact
to node performance first and usability second, punting a lot of the
de-duplication and RbW implications of having full column values, or
materializing stuff off-heap for consumption from a user and flagging as
persisted to disk etc, for future work on the feature. I don't personally
have any time to devote to moving the feature forward now but as Jeff
indicates, Jay and Simon are both active in the space and taking up the
torch.


On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa <jj...@gmail.com> wrote:

> Here's a deck of some proposed additions, discussed at one of the NGCC
> sessions last fall:
>
> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>
>
>
> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:
>
> > Hi all,
> >
> > We are currently designing a system that allows our Cassandra clusters to
> > produce a stream of data updates. Naturally, we have been evaluating if
> CDC
> > can aid in this endeavor. We have found several challenges in using CDC
> for
> > this purpose.
> >
> > CDC provides only the mutation as opposed to the full column value, which
> > tends to be of limited use for us. Applications might want to know the
> full
> > column value, without having to issue a read back. We also see value in
> > being able to publish the full column value both before and after the
> > update. This is especially true when deleting a column since this stream
> > may be joined with others, or consumers may require other fields to
> > properly process the delete.
> >
> > Additionally, there is some difficulty with processing CDC itself such
> as:
> > - Updates not being immediately available (addressed by CASSANDRA-12148)
> > - Each node providing an independent streams of updates that must be
> > unified and deduplicated
> >
> > Our question is, what is the vision for CDC development? The current
> > implementation could work for some use cases, but is a ways from a
> general
> > streaming solution. I understand that the nature of Cassandra makes this
> > quite complicated, but are there any thoughts or desires on the future
> > direction of CDC?
> >
> > Thanks
> >
> >
>

Re: CDC usability and future development

Posted by Jeff Jirsa <jj...@gmail.com>.

Here's a deck of some proposed additions, discussed at one of the NGCC
sessions last fall:

https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf



On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:

> Hi all,
>
> We are currently designing a system that allows our Cassandra clusters to
> produce a stream of data updates. Naturally, we have been evaluating if CDC
> can aid in this endeavor. We have found several challenges in using CDC for
> this purpose.
>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.
>
> Additionally, there is some difficulty with processing CDC itself such as:
> - Updates not being immediately available (addressed by CASSANDRA-12148)
> - Each node providing an independent streams of updates that must be
> unified and deduplicated
>
> Our question is, what is the vision for CDC development? The current
> implementation could work for some use cases, but is a ways from a general
> streaming solution. I understand that the nature of Cassandra makes this
> quite complicated, but are there any thoughts or desires on the future
> direction of CDC?
>
> Thanks
>
>

Re: CDC usability and future development

Posted by Jeff Jirsa <jj...@gmail.com>.

Here's a deck of some proposed additions, discussed at one of the NGCC
sessions last fall:

https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf



On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme <as...@yelp.com> wrote:

> Hi all,
>
> We are currently designing a system that allows our Cassandra clusters to
> produce a stream of data updates. Naturally, we have been evaluating if CDC
> can aid in this endeavor. We have found several challenges in using CDC for
> this purpose.
>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.
>
> Additionally, there is some difficulty with processing CDC itself such as:
> - Updates not being immediately available (addressed by CASSANDRA-12148)
> - Each node providing an independent streams of updates that must be
> unified and deduplicated
>
> Our question is, what is the vision for CDC development? The current
> implementation could work for some use cases, but is a ways from a general
> streaming solution. I understand that the nature of Cassandra makes this
> quite complicated, but are there any thoughts or desires on the future
> direction of CDC?
>
> Thanks
>
>