You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Venki g <ve...@gmail.com> on 2020/01/20 23:50:30 UTC

HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

Hi,

I am using a spark job to upsert the incremental delta files from S3 into
Hudi storage using HoodieDeltaStreamer.sync() API , The incremental spark
job is failing with below exception

java.lang.RuntimeException:
org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to
find previous checkpoint. Please double check if this table was indeed
built via delta streamer
at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by:
org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to
find previous checkpoint. Please double check if this table was indeed
built via delta streamer
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
... 7 more

I found the recent commit file does not have
""deltastreamer.checkpoint.key" in the commit file. I checked the second
last commit file and it has this key.

Link to driver log(has delta streamer config passed and other info) -
https://pastebin.pl/view/raw/9606beb0

Link to most recent commit - https://pastebin.pl/view/raw/9606beb0

When this happened for the first time, I was able to rollback the latest
commit and loaded the data again and went past this exception. Since, this
exception has started occurring again, I would like to understand the issue
here and find the fix if any.

Would highly appreciate any help on this.

Thanks
Venkatesh

Re: HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Venkatesh,

It should keep writing an empty commit with checkpoints.. I think your
situation is different though.. You are seeing a commit wth no checkpoint
key.
From the code, we always set this.

Since I see com.emr.java on the stack trace, can any of the amazon folks
confirm any code changes on the EMR code that is getting invoked?

Thanks
VInoth

On Tue, Jan 21, 2020 at 12:57 PM Venki g <ve...@gmail.com> wrote:

> Hi Vinoth,
>
> Thanks for looking into this.
>
> The source delta file had 7877 records in it,
>
> Driver log showing the number of records - 20/01/17 03:51:04 INFO
> HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions
> 29, TotalComparisons 7877, SafeParallelism 1
>
> Looking at the commit history, looks like some changes were done on the
> partition(esp delete).
>
> 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration,
> value=93112
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.clean.numFilesDeleted, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.commitTime, value=1579233069000
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.duration, value=127138
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalBytesWritten, value=6908850
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCreateTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesInsert, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesUpdate, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesSize, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalPartitionsWritten, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalRecordsWritten, value=95705
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalScanTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpsertTime, value=7013
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.duration, value=153732
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.duration, value=437
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.numFilesFinalized, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.index.update.duration, value=0
>
> Assuming, it was an empty commit. Shouldn't it be still writing the
> checkpoint key read from the last commit to the empty commit file? Since
> checkpoint key is always needed on the recent commit file to avoid this
> exception(*HoodieDeltaStreamerException: Unable to find previous
> checkpoint. Please double check if this table was indeed built via delta
> streamer*).
>
> Can I workaround the problem by passing the most recent checkpoint key in
> the config while calling deltastreamer.sync()?
>
> Thanks
> Venkatesh
>
> On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Venki,
> >
> > Thanks for reporting this. The latest commit file seems to be empty? I am
> > wondering if this is happening because there was no new data to process
> and
> > the tool wrote an empty commit file..
> > Can you confirm if this seems to match the case?
> >
> > Thanks
> > Vinoth
> >
> >
> > On Mon, Jan 20, 2020 at 4:00 PM Venki g <ve...@gmail.com> wrote:
> >
> > > Correcting the link to commit file
> > >
> > > On Mon, Jan 20, 2020 at 3:50 PM Venki g <ve...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am using a spark job to upsert the incremental delta files from S3
> > into
> > > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental
> > spark
> > > > job is failing with below exception
> > > >
> > > > java.lang.RuntimeException:
> > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> > Unable
> > > to
> > > > find previous checkpoint. Please double check if this table was
> indeed
> > > > built via delta streamer
> > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > > at
> > > >
> > >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> > > > Caused by:
> > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> > Unable
> > > to
> > > > find previous checkpoint. Please double check if this table was
> indeed
> > > > built via delta streamer
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> > > > ... 7 more
> > > >
> > > > I found the recent commit file does not have
> > > > ""deltastreamer.checkpoint.key" in the commit file. I checked the
> > second
> > > > last commit file and it has this key.
> > > >
> > > > Link to driver log(has delta streamer config passed and other info) -
> > > > https://pastebin.pl/view/raw/9606beb0
> > > >
> > > > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> > > > <https://pastebin.pl/view/raw/defc32ae> *
> > > >
> > > > When this happened for the first time, I was able to rollback the
> > latest
> > > > commit and loaded the data again and went past this exception. Since,
> > > this
> > > > exception has started occurring again, I would like to understand the
> > > issue
> > > > here and find the fix if any.
> > > >
> > > > Would highly appreciate any help on this.
> > > >
> > > > Thanks
> > > > Venkatesh
> > > >
> > >
> >
>

Re: HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

Posted by Venki g <ve...@gmail.com>.
Hi Vinoth,

Thanks for looking into this.

The source delta file had 7877 records in it,

Driver log showing the number of records - 20/01/17 03:51:04 INFO
HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions
29, TotalComparisons 7877, SafeParallelism 1

Looking at the commit history, looks like some changes were done on the
partition(esp delete).

20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration,
value=93112
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.clean.numFilesDeleted, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.commitTime, value=1579233069000
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.duration, value=127138
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalBytesWritten, value=6908850
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalCreateTime, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalFilesInsert, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalFilesUpdate, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalLogFilesSize, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalPartitionsWritten, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalRecordsWritten, value=95705
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalScanTime, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalUpsertTime, value=7013
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.deltastreamer.duration, value=153732
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.finalize.duration, value=437
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.finalize.numFilesFinalized, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.index.update.duration, value=0

Assuming, it was an empty commit. Shouldn't it be still writing the
checkpoint key read from the last commit to the empty commit file? Since
checkpoint key is always needed on the recent commit file to avoid this
exception(*HoodieDeltaStreamerException: Unable to find previous
checkpoint. Please double check if this table was indeed built via delta
streamer*).

Can I workaround the problem by passing the most recent checkpoint key in
the config while calling deltastreamer.sync()?

Thanks
Venkatesh

On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Venki,
>
> Thanks for reporting this. The latest commit file seems to be empty? I am
> wondering if this is happening because there was no new data to process and
> the tool wrote an empty commit file..
> Can you confirm if this seems to match the case?
>
> Thanks
> Vinoth
>
>
> On Mon, Jan 20, 2020 at 4:00 PM Venki g <ve...@gmail.com> wrote:
>
> > Correcting the link to commit file
> >
> > On Mon, Jan 20, 2020 at 3:50 PM Venki g <ve...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am using a spark job to upsert the incremental delta files from S3
> into
> > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental
> spark
> > > job is failing with below exception
> > >
> > > java.lang.RuntimeException:
> > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> Unable
> > to
> > > find previous checkpoint. Please double check if this table was indeed
> > > built via delta streamer
> > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > at
> > >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> > > Caused by:
> > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> Unable
> > to
> > > find previous checkpoint. Please double check if this table was indeed
> > > built via delta streamer
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> > > ... 7 more
> > >
> > > I found the recent commit file does not have
> > > ""deltastreamer.checkpoint.key" in the commit file. I checked the
> second
> > > last commit file and it has this key.
> > >
> > > Link to driver log(has delta streamer config passed and other info) -
> > > https://pastebin.pl/view/raw/9606beb0
> > >
> > > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> > > <https://pastebin.pl/view/raw/defc32ae> *
> > >
> > > When this happened for the first time, I was able to rollback the
> latest
> > > commit and loaded the data again and went past this exception. Since,
> > this
> > > exception has started occurring again, I would like to understand the
> > issue
> > > here and find the fix if any.
> > >
> > > Would highly appreciate any help on this.
> > >
> > > Thanks
> > > Venkatesh
> > >
> >
>

Re: HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Venki,

Thanks for reporting this. The latest commit file seems to be empty? I am
wondering if this is happening because there was no new data to process and
the tool wrote an empty commit file..
Can you confirm if this seems to match the case?

Thanks
Vinoth


On Mon, Jan 20, 2020 at 4:00 PM Venki g <ve...@gmail.com> wrote:

> Correcting the link to commit file
>
> On Mon, Jan 20, 2020 at 3:50 PM Venki g <ve...@gmail.com> wrote:
>
> > Hi,
> >
> > I am using a spark job to upsert the incremental delta files from S3 into
> > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental spark
> > job is failing with below exception
> >
> > java.lang.RuntimeException:
> > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable
> to
> > find previous checkpoint. Please double check if this table was indeed
> > built via delta streamer
> > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> > Caused by:
> > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable
> to
> > find previous checkpoint. Please double check if this table was indeed
> > built via delta streamer
> > at
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> > at
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> > at
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> > ... 7 more
> >
> > I found the recent commit file does not have
> > ""deltastreamer.checkpoint.key" in the commit file. I checked the second
> > last commit file and it has this key.
> >
> > Link to driver log(has delta streamer config passed and other info) -
> > https://pastebin.pl/view/raw/9606beb0
> >
> > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> > <https://pastebin.pl/view/raw/defc32ae> *
> >
> > When this happened for the first time, I was able to rollback the latest
> > commit and loaded the data again and went past this exception. Since,
> this
> > exception has started occurring again, I would like to understand the
> issue
> > here and find the fix if any.
> >
> > Would highly appreciate any help on this.
> >
> > Thanks
> > Venkatesh
> >
>

Re: HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

Posted by Venki g <ve...@gmail.com>.
Correcting the link to commit file

On Mon, Jan 20, 2020 at 3:50 PM Venki g <ve...@gmail.com> wrote:

> Hi,
>
> I am using a spark job to upsert the incremental delta files from S3 into
> Hudi storage using HoodieDeltaStreamer.sync() API , The incremental spark
> job is failing with below exception
>
> java.lang.RuntimeException:
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to
> find previous checkpoint. Please double check if this table was indeed
> built via delta streamer
> at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> Caused by:
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to
> find previous checkpoint. Please double check if this table was indeed
> built via delta streamer
> at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> at
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> ... 7 more
>
> I found the recent commit file does not have
> ""deltastreamer.checkpoint.key" in the commit file. I checked the second
> last commit file and it has this key.
>
> Link to driver log(has delta streamer config passed and other info) -
> https://pastebin.pl/view/raw/9606beb0
>
> *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> <https://pastebin.pl/view/raw/defc32ae> *
>
> When this happened for the first time, I was able to rollback the latest
> commit and loaded the data again and went past this exception. Since, this
> exception has started occurring again, I would like to understand the issue
> here and find the fix if any.
>
> Would highly appreciate any help on this.
>
> Thanks
> Venkatesh
>