You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Syed Abdul Kather <in...@gmail.com> on 2020/01/13 08:39:05 UTC

Snapshot from cold storage store and continues with latest data from biglog

Hi Team,

We have on-board a few tables that have really huge number of records (100
M records ). The plan is like enable the binlog for database that is no
issues as stream can handle the load  . But for loading the snapshot . We
have use sqoop to import whole table to s3.

What we required here?
 Can we load the whole dump sqooped record to hudi table then we would use
the stream(binlog data comes vai kafka)

            Thanks and Regards,
        S SYED ABDUL KATHER
         *Bigdata Lead@Tathastu.ai*
*           +91-7411011661*

Re: Snapshot from cold storage store and continues with latest data from biglog

Posted by Shiyan Xu <xu...@gmail.com>.

Hi Syed, as Vinoth mentioned, the HoodieSnapshotCopier is meant for this
purpose

You may also read more on the RFC-9, which plans to introduce a
backward-compatible tool to cover HoodieSnapshotCopier
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
Unfortunately I'm not actively working on this. If you're interested free
feel to pick it up. I'd be happy to help with that.

On Wed, Feb 12, 2020 at 7:25 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Syed,
>
> Apologies for the delay.  If you are using copy-on-write, you can look into
> savepoints (although I realize its only exposed at the rdd api level).. We
> do have a tool called HoodieSnapshotCopier in hoodie-utilities, to take
> periodic copies/snapshots of a table for backup purposes, as of a given
> commit. Raymond (if you arr here) , has an RFC to enhance that even..
> Running the copier (please test it first, since its not used in OSS that
> much IIUC) periodically, say every day would achieve your goals I believe..
>
>
> https://github.com/apache/incubator-hudi/blob/c2c0f6b13d5b72b3098ed1b343b0a89679f854b3/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java
>
> Any issues in the tool would be simple to fix. Tool itself is couple
> hundred lines, that all.
>
> Thanks
> Vinoth
>
> On Mon, Feb 10, 2020 at 3:56 AM Syed Abdul Kather <in...@gmail.com>
> wrote:
>
> > Yes. Also for restoring the data from cold storage.
> >
> > Use case here :
> > We stream data using debezium and push to Kafka we have retention in
> Kafka
> > as 7 days. In case the destination table created using the hudi got
> crashed
> > or we need to repopulate then we need a way that can help us restore the
> > data.
> >
> > Thanks and Regards,
> > S SYED ABDUL KATHER
> > *Data platform Lead @ Tathastu.ai*
> >
> > *+91 - 7411011661*
> >
> >
> > On Mon, Jan 13, 2020 at 10:17 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Syed,
> > >
> > > If I follow correctly, are you asking how to do a bulk load first and
> > then
> > > use delta streamer on top of that dataset to apply binlogs from Kafka?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather <in.abdul@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > We have on-board a few tables that have really huge number of records
> > > (100
> > > > M records ). The plan is like enable the binlog for database that is
> no
> > > > issues as stream can handle the load  . But for loading the snapshot
> .
> > We
> > > > have use sqoop to import whole table to s3.
> > > >
> > > > What we required here?
> > > >  Can we load the whole dump sqooped record to hudi table then we
> would
> > > use
> > > > the stream(binlog data comes vai kafka)
> > > >
> > > >             Thanks and Regards,
> > > >         S SYED ABDUL KATHER
> > > >          *Bigdata Lead@Tathastu.ai*
> > > > *           +91-7411011661*
> > > >
> > >
> >
>

Re: Snapshot from cold storage store and continues with latest data from biglog

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Syed,

Apologies for the delay.  If you are using copy-on-write, you can look into
savepoints (although I realize its only exposed at the rdd api level).. We
do have a tool called HoodieSnapshotCopier in hoodie-utilities, to take
periodic copies/snapshots of a table for backup purposes, as of a given
commit. Raymond (if you arr here) , has an RFC to enhance that even..
Running the copier (please test it first, since its not used in OSS that
much IIUC) periodically, say every day would achieve your goals I believe..

https://github.com/apache/incubator-hudi/blob/c2c0f6b13d5b72b3098ed1b343b0a89679f854b3/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java

Any issues in the tool would be simple to fix. Tool itself is couple
hundred lines, that all.

Thanks
Vinoth

On Mon, Feb 10, 2020 at 3:56 AM Syed Abdul Kather <in...@gmail.com>
wrote:

> Yes. Also for restoring the data from cold storage.
>
> Use case here :
> We stream data using debezium and push to Kafka we have retention in Kafka
> as 7 days. In case the destination table created using the hudi got crashed
> or we need to repopulate then we need a way that can help us restore the
> data.
>
> Thanks and Regards,
> S SYED ABDUL KATHER
> *Data platform Lead @ Tathastu.ai*
>
> *+91 - 7411011661*
>
>
> On Mon, Jan 13, 2020 at 10:17 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Syed,
> >
> > If I follow correctly, are you asking how to do a bulk load first and
> then
> > use delta streamer on top of that dataset to apply binlogs from Kafka?
> >
> > Thanks
> > Vinoth
> >
> > On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather <in...@gmail.com>
> > wrote:
> >
> > > Hi Team,
> > >
> > > We have on-board a few tables that have really huge number of records
> > (100
> > > M records ). The plan is like enable the binlog for database that is no
> > > issues as stream can handle the load  . But for loading the snapshot .
> We
> > > have use sqoop to import whole table to s3.
> > >
> > > What we required here?
> > >  Can we load the whole dump sqooped record to hudi table then we would
> > use
> > > the stream(binlog data comes vai kafka)
> > >
> > >             Thanks and Regards,
> > >         S SYED ABDUL KATHER
> > >          *Bigdata Lead@Tathastu.ai*
> > > *           +91-7411011661*
> > >
> >
>

Re: Snapshot from cold storage store and continues with latest data from biglog

Posted by Syed Abdul Kather <in...@gmail.com>.

Yes. Also for restoring the data from cold storage.

Use case here :
We stream data using debezium and push to Kafka we have retention in Kafka
as 7 days. In case the destination table created using the hudi got crashed
or we need to repopulate then we need a way that can help us restore the
data.

Thanks and Regards,
S SYED ABDUL KATHER
*Data platform Lead @ Tathastu.ai*

*+91 - 7411011661*


On Mon, Jan 13, 2020 at 10:17 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Syed,
>
> If I follow correctly, are you asking how to do a bulk load first and then
> use delta streamer on top of that dataset to apply binlogs from Kafka?
>
> Thanks
> Vinoth
>
> On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather <in...@gmail.com>
> wrote:
>
> > Hi Team,
> >
> > We have on-board a few tables that have really huge number of records
> (100
> > M records ). The plan is like enable the binlog for database that is no
> > issues as stream can handle the load  . But for loading the snapshot . We
> > have use sqoop to import whole table to s3.
> >
> > What we required here?
> >  Can we load the whole dump sqooped record to hudi table then we would
> use
> > the stream(binlog data comes vai kafka)
> >
> >             Thanks and Regards,
> >         S SYED ABDUL KATHER
> >          *Bigdata Lead@Tathastu.ai*
> > *           +91-7411011661*
> >
>

Re: Snapshot from cold storage store and continues with latest data from biglog

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Syed,

If I follow correctly, are you asking how to do a bulk load first and then
use delta streamer on top of that dataset to apply binlogs from Kafka?

Thanks
Vinoth

On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather <in...@gmail.com>
wrote:

> Hi Team,
>
> We have on-board a few tables that have really huge number of records (100
> M records ). The plan is like enable the binlog for database that is no
> issues as stream can handle the load  . But for loading the snapshot . We
> have use sqoop to import whole table to s3.
>
> What we required here?
>  Can we load the whole dump sqooped record to hudi table then we would use
> the stream(binlog data comes vai kafka)
>
>             Thanks and Regards,
>         S SYED ABDUL KATHER
>          *Bigdata Lead@Tathastu.ai*
> *           +91-7411011661*
>