You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Yixue Zhu <yx...@gmail.com> on 2020/05/14 20:19:04 UTC

preCombine API enhancement for Mongo Oplog integration

We are working on Mongo Oplog integration with Hudi, to stream Mongo
updates to Hudi tables.

There are 4 Mongo OpLog operations we need to handle, CRUD (create,
read, update, delete).

Currently Hudi handle create/read, delete, but not update well with
existing preCombine API in HoodieRecordPayload class. In particularly,
Update operation contains "patch" field, which is extended Json
describing update for dot separated field paths.

We need to pass Avro schema to preCombine API for it to work:

Even though BaseAvroPayload constructor accepts GenericRecord, which
has Avro schema reference, but it materialize GenericRecord to bytes,
to support serialization/deserialization by ExternalSpillableMap.


Is there concern/objection to this? in other words, have I overlooked something?

I have created https://issues.apache.org/jira/browse/HUDI-898 to track it.

Best,
Yixue

-- 
Best Regards,
yixue

Re: preCombine API enhancement for Mongo Oplog integration

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Yixue,

Thanks for starting this thread! I have actually been thinking if we should
just deprecate preCombine() and simply use combineAndGetUpdateValue() there
as well. But, it boiled down to implementation efficiency.. Having the
entire payload during preCombine() helps us keep the actual data serialized
during shuffles (much more compact than shuffling avro in my experience) .

I am fine with this addition overall. We can deprecate existing preCombine
and remove over the next few releases..

Let's wait for others to chime in as well

Thanks
Vinoth


On Thu, May 14, 2020 at 1:20 PM Yixue Zhu <yx...@gmail.com> wrote:

> We are working on Mongo Oplog integration with Hudi, to stream Mongo
> updates to Hudi tables.
>
> There are 4 Mongo OpLog operations we need to handle, CRUD (create,
> read, update, delete).
>
> Currently Hudi handle create/read, delete, but not update well with
> existing preCombine API in HoodieRecordPayload class. In particularly,
> Update operation contains "patch" field, which is extended Json
> describing update for dot separated field paths.
>
> We need to pass Avro schema to preCombine API for it to work:
>
> Even though BaseAvroPayload constructor accepts GenericRecord, which
> has Avro schema reference, but it materialize GenericRecord to bytes,
> to support serialization/deserialization by ExternalSpillableMap.
>
>
> Is there concern/objection to this? in other words, have I overlooked
> something?
>
> I have created https://issues.apache.org/jira/browse/HUDI-898 to track it.
>
> Best,
> Yixue
>
> --
> Best Regards,
> yixue
>