You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/10 11:06:37 UTC

[GitHub] [hudi] jcunhafonte opened a new issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

jcunhafonte opened a new issue #1813:
URL: https://github.com/apache/hudi/issues/1813


   I'm executing the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when running the command suggested for the second and following times.
   
   Have DMS generate the raw .parquet files in S3.
   Use HoodieDeltaStreamer to process the raw .parquet files:
   
   ```
   spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
     --packages org.apache.spark:spark-avro_2.12:2.4.4 \
     --master yarn --deploy-mode client \
     hudi-utilities-bundle_2.12-0.5.2-incubating.jar \
     --table-type COPY_ON_WRITE \
     --source-ordering-field dms_timestamp \
     --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
     --target-base-path s3://my-test-bucket/hudi_orders --target-table hudi_orders \
     --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
     --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
     --props file:///usr/lib/hudi/dfs-source.properties \
     --hoodie-conf hoodie.datasource.write.recordkey.field=id,hoodie.datasource.write.partitionpath.field=id,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders
   ```
   When I run it for the first time it works perfectly however when I try to keep "refreshing" the data on a scheduled job I get the following error:
   ```
   ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down
   org.apache.hudi.exception.HoodieException: Please provide a valid schema provider class!
   	at org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
   	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
   	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
   	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   ```
   
   Hudi version : 0.5.2 (incubating)
   Spark version : 2.4.4
   Hive version : 3.1.2
   Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : No
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 edited a comment on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

tooptoop4 edited a comment on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-671512191


   @bhasudha how does checkpointing work here? ie after some time of running DeltaStreamer job i need to stop the DeltaStreamer job, destroy old EC2, launch new EC2, restart DeltaStreamer job. How does DeltaStreamer job know to skip some of the raw change capture parquets (that were already processed into Hudi table) and resume from certain point of them?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-671687932


   @tooptoop4 : The checkpoints are stored as part of .commit files in .hoodie folder and will persist across cluster, application restarts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-657295902


   @jcunhafonte  Could you try using the DeltaStreamer in continuous mode rather than using the scheduled job ? I think what's happening is the schema provider is [initialized for the very first time](https://github.com/apache/hudi/blob/20ac7c3337a14dd777f6ebe21b13dab2786f2479/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L229) if its null ( that why your first run is working fine). After that, the initialized state is lost since you are recreating the DeltaSync object by refreshing the command in a schedule job (instead of running it in continuous mode where it has access to the same DeltaSync object with initialized schema provider object). 
   
   If you would rather want to use the scheduled job, then the tool expects you provide the schema provider explicitly via config.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jcunhafonte commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

jcunhafonte commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-657559542


   Thank you for your help @bhasudha.
   
   Running the command continuously works however what happens if my cluster fails? Do I need to clean up Hudi target table and execute the job again? That would lose all the historical changes.
   
   In the article shared above (http://hudi.apache.org/blog/change-capture-using-aws/) the deltastreamer command is executed multiple times and the schema is not provided, however, the Hudi version is 0.5.1. Was this a breaking change on 0.5.2?
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-658237088


   @jcunhafonte :  This could happen when there are no more files to be ingested when running in non-continuous mode. I have opened a jira to get it fixed in 0.6.0 :  https://issues.apache.org/jira/browse/HUDI-1091. With no input data, automatic schema resolution wont be possible. In continuous mode, we do cache the previous schema registry instance to handle this case. Can you try with that. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-672595024


   @bschell which PR fixes it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-657990530


   @jcunhafonte  you wouldn't need to clean up the Hudi table as the Deltastreamer checkpoints the source offsets along with Hudi metadata. So when the job is run again it can pick up from where it left last time.
   
   Regarding your second question on schema provider in the http://hudi.apache.org/blog/change-capture-using-aws/ there was no breaking change.  @bvaradar could you help with sharing more context on what happens when schema provider is not present and the implications of running the Deltastreamer sync once mode iteratively (NOT in the continuous mode) without providing schema provider .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-672433664


   @jcunhafonte : @bschell confirmed it works in master. Can you try using master or wait for 0.6 (Release should happen in a weeks time).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jcunhafonte commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

jcunhafonte commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-666724588


   Thank you again for the clarification @bhasudha.
   
   @bvaradar I have tested it with continuous mode and I can confirm it works fine with that option. Without continuous mode it works if there are new changes to the directories that is reading. Thank you once again. Let me know if I can clarify anything else.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 edited a comment on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

tooptoop4 edited a comment on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-672595024


   @bschell @bvaradar  which PR fixes it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1813:
URL: https://github.com/apache/hudi/issues/1813


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-671512191


   @bhasudha how does checkpointing work here? ie after some time of running DeltaStreamer job i need to stop the DeltaStreamer job, destroy old EC2, launch new EC2, restart DeltaStreamer job. How does DeltaStreamer job know to skip some of the raw change capture parquets and resume from certain point of them?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org