You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Sunny Shah <su...@tinyowl.co.in> on 2016/02/06 11:37:37 UTC

Re: MongoDB Kafka Connect driver

Hello Everyone,

Thanks a lot for your valuable responses.

We will use external database to store Key and Kafka-offset, We won't set
any preference on which database to use, We will leave it to the
driver-user by using a flexible data-access-model like Apache Metamodel.

@James, Even for MongoDB connect driver till time T2 it won't be in the
consistent stage.

On Sat, Jan 30, 2016 at 3:43 AM, James Cheng <jc...@tivo.com> wrote:

> Not sure if this will help anything, but just throwing it out there.
>
> The Maxwell and mypipe projects both do CDC from MySQL and support
> bootstrapping. The way they do it is kind of "eventually consistent".
>
> 1) At time T1, record coordinates of the end of the binlog as of T1.
> 2) At time T2, do a full dump of the database into Kafka.
> 3) Connect back to the binlog in the coordinates recorded in step #1, and
> emit all those records into Kafka.
>
> As Jay mentioned, MySQL supports full row images. At the start of step #3,
> the kafka topic contains all rows as of time T2. It is possible that during
> step #3, that you will emit rows that changed between T1 and T2. From the
> point of view of the consumer of the kafka topic, they would see rows that
> went "back in time". However, as step #3 progresses, and the consumer keeps
> reading, those rows would eventually converge down to their final state.
>
> Maxwell: https://github.com/zendesk/maxwell
> mypipe: https://github.com/mardambey/mypipe
>
> Does that idea help in any way? Btw, a reason it is done this way is that
> it is "difficult" to do #1 and #2 above in a coordinated way without
> locking the database or without adding additional outside dependencies (LVM
> snapshots, being a specific one).
>
> Btw, I glanced at some docs about the Mongodb oplog. It seems that each
> oplog contains
> 1) A way to identify the document that the change applies to.
> 2) A series of mongodb commands (set, unset) to alter the document in #1
> to become the new document.
>
> Thoughts:
> For #1, does it identify a particular "version" of a document? (I don't
> know much about mongodb). If so, you might be able to use it to determine
> if the change should even be attempted to be applied to the object.
> For #2, doesn't that mean you'll need "understand" mongodb's syntax and
> commands? Although maybe it is simply sets/unsets/deletes, in which case
> it's maybe pretty simple.
>
> -James
>
> > On Jan 29, 2016, at 9:39 AM, Jay Kreps <ja...@confluent.io> wrote:
> >
> > Also, most database provide a "full logging" option that let's you
> capture
> > the whole row in the log (I know Oracle and MySQL have this) but it
> sounds
> > like Mongo doesn't yet. That would be the ideal solution.
> >
> > -Jay
> >
> > On Fri, Jan 29, 2016 at 9:38 AM, Jay Kreps <ja...@confluent.io> wrote:
> >
> >> Ah, agreed. This approach is actually quite common in change capture,
> >> though. For many use cases getting the final value is actually
> preferable
> >> to getting intermediates. The exception is usually if you want to do
> >> analytics on something like number of changes.
> >>
> >> On Fri, Jan 29, 2016 at 9:35 AM, Ewen Cheslack-Postava <
> ewen@confluent.io>
> >> wrote:
> >>
> >>> Jay,
> >>>
> >>> You can query after the fact, but you're not necessarily going to get
> the
> >>> same value back. There could easily be dozens of changes to the
> document
> >>> in
> >>> the oplog so the delta you see may not even make sense given the
> current
> >>> state of the document. Even if you can apply it the delta, you'd still
> be
> >>> seeing data that is newer than the update. You can of course take this
> >>> shortcut, but it won't give correct results. And if the data has been
> >>> deleted since then, you won't even be able to write the full record...
> As
> >>> far as I know, the way the op log is exposed won't let you do something
> >>> like pin a query to the state of the db at a specific point in the op
> log
> >>> and you may be reading from the beginning of the op log, so I don't
> think
> >>> there's a way to get correct results by just querying the DB for the
> full
> >>> documents.
> >>>
> >>> Strictly speaking you don't need to get all the data in memory, you
> just
> >>> need a record of the current set of values somewhere. This is what I
> was
> >>> describing following those two options -- if you do an initial dump to
> >>> Kafka, you could track only offsets in memory and read back full
> values as
> >>> needed to apply deltas, but this of course requires random reads into
> your
> >>> Kafka topic (but may perform fine in practice depending on the
> workload).
> >>>
> >>> -Ewen
> >>>
> >>> On Fri, Jan 29, 2016 at 9:12 AM, Jay Kreps <ja...@confluent.io> wrote:
> >>>
> >>>> Hey Ewen, how come you need to get it all in memory for approach (1)?
> I
> >>>> guess the obvious thing to do would just be to query for the record
> >>>> after-image when you get the diff--e.g. just read a batch of changes
> and
> >>>> multi-get the final values. I don't know how bad the overhead of this
> >>> would
> >>>> be...batching might reduce it a fair amount. The guarantees for this
> are
> >>>> slightly different than the pure oplog too (you get the current value
> >>> not
> >>>> every necessarily every intermediate) but that should be okay for most
> >>>> uses.
> >>>>
> >>>> -Jay
> >>>>
> >>>> On Fri, Jan 29, 2016 at 8:54 AM, Ewen Cheslack-Postava <
> >>> ewen@confluent.io>
> >>>> wrote:
> >>>>
> >>>>> Sunny,
> >>>>>
> >>>>> As I said on Twitter, I'm stoked to hear you're working on a Mongo
> >>>>> connector! It struck me as a pretty natural source to tackle since it
> >>>> does
> >>>>> such a nice job of cleanly exposing the op log.
> >>>>>
> >>>>> Regarding the problem of only getting deltas, unfortunately there is
> >>> not
> >>>> a
> >>>>> trivial solution here -- if you want to generate the full updated
> >>> record,
> >>>>> you're going to have to have a way to recover the original document.
> >>>>>
> >>>>> In fact, I'm curious how you were thinking of even bootstrapping. Are
> >>> you
> >>>>> going to do a full dump and then start reading the op log? Is there a
> >>>> good
> >>>>> way to do the dump and figure out the exact location in the op log
> >>> that
> >>>> the
> >>>>> query generating the dump was initially performed? I know that
> >>> internally
> >>>>> mongo effectively does these two steps, but I'm not sure if the
> >>> necessary
> >>>>> info is exposed via normal queries.
> >>>>>
> >>>>> If you want to reconstitute the data, I can think of a couple of
> >>> options:
> >>>>>
> >>>>> 1. Try to reconstitute inline in the connector. This seems difficult
> >>> to
> >>>>> make work in practice. At some point you basically have to query for
> >>> the
> >>>>> entire data set to bring it into memory and then the connector is
> >>>>> effectively just applying the deltas to its in memory copy and then
> >>> just
> >>>>> generating one output record containing the full document each time
> it
> >>>>> applies an update.
> >>>>> 2. Make the connector send just the updates and have a separate
> stream
> >>>>> processing job perform the reconstitution and send to another topic.
> >>> In
> >>>>> this case, the first topic should not be compacted, but the second
> one
> >>>>> could be.
> >>>>>
> >>>>> Unfortunately, without additional hooks into the database, there's
> not
> >>>> much
> >>>>> you can do besides this pretty heavyweight process. There may be some
> >>>>> tricks you can use to reduce the amount of memory used during the
> >>> process
> >>>>> (e.g. keep a small cache of actual records and for the rest only
> store
> >>>>> Kafka offsets for the last full value, performing a (possibly
> >>> expensive)
> >>>>> random read as necessary to get the full document value back), but to
> >>> get
> >>>>> full correctness you will need to perform this process.
> >>>>>
> >>>>> In terms of Kafka Connect supporting something like this, I'm not
> sure
> >>>> how
> >>>>> general it could be made, or that you even want to perform the
> process
> >>>>> inline with the Kafka Connect job. If it's an issue that repeatedly
> >>>> arises
> >>>>> across a variety of systems, then we should consider how to address
> it
> >>>> more
> >>>>> generally.
> >>>>>
> >>>>> -Ewen
> >>>>>
> >>>>> On Tue, Jan 26, 2016 at 8:43 PM, Sunny Shah <su...@tinyowl.co.in>
> >>> wrote:
> >>>>>
> >>>>>>
> >>>>>> Hi ,
> >>>>>>
> >>>>>> We are trying to write a Kafka-connect connector for Mongodb. The
> >>> issue
> >>>>>> is, MongoDB does not provide an entire changed document for update
> >>>>>> operations, It just provides the modified fields.
> >>>>>>
> >>>>>> if Kafka allows custom log compaction then It is possible to
> >>> eventually
> >>>>>> merge an entire document and subsequent update to to create an
> >>> entire
> >>>>>> record again.
> >>>>>>
> >>>>>> As Ewen pointed out to me on twitter, this is not possible, then
> >>> What
> >>>> is
> >>>>>> the Kafka-connect way of solving this issue?
> >>>>>>
> >>>>>> @Ewen, Thanks a lot for a really quick answer on twitter.
> >>>>>>
> >>>>>> --
> >>>>>> Thanks and Regards,
> >>>>>> Sunny
> >>>>>>
> >>>>>> The contents of this e-mail and any attachment(s) are confidential
> >>> and
> >>>>>> intended for the named recipient(s) only. It shall not attach any
> >>>>> liability
> >>>>>> on the originator or TinyOwl Technology Pvt. Ltd. or its affiliates.
> >>>> Any
> >>>>>> form of reproduction, dissemination, copying, disclosure,
> >>> modification,
> >>>>>> distribution and / or publication of this message without the prior
> >>>>> written
> >>>>>> consent of the author of this e-mail is strictly prohibited. If you
> >>>> have
> >>>>>> received this email in error please delete it and notify the sender
> >>>>>> immediately. You are liable to the company (TinyOwl Technology Pvt.
> >>>>> Ltd.) in
> >>>>>> case of any breach in ​
> >>>>>> ​confidentialy (through any form of communication) wherein the
> >>> company
> >>>>> has
> >>>>>> the right to injunct legal action and an equitable relief for
> >>> damages.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks,
> >>>>> Ewen
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Ewen
> >>>
> >>
> >>
>
>
> ________________________________
>
> This email and any attachments may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, copying,
> or distribution of this email (or any attachments) by others is prohibited.
> If you are not the intended recipient, please contact the sender
> immediately and permanently delete this email and any attachments. No
> employee or agent of TiVo Inc. is authorized to conclude any binding
> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
> Inc. may only be made by a signed written agreement.
>



-- 
Thanks and Regards,
 Sunny

-- 
The contents of this e-mail and any attachment(s) are confidential and 
intended for the named recipient(s) only. It shall not attach any liability 
on the originator or TinyOwl Technology Pvt. Ltd. or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of the author of this e-mail is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. You are liable to the company (TinyOwl Technology Pvt. Ltd.) in 
case of any breach in ​
​confidentialy (through any form of communication) wherein the company has 
the right to injunct legal action and an equitable relief for damages.