You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/11 23:45:21 UTC

[GitHub] [hudi] rohit-m-99 opened a new issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

rohit-m-99 opened a new issue #4796:
URL: https://github.com/apache/hudi/issues/4796


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   When running the following script I am able to ingest sims and cluster. However when I kill the spark job and rerun I am met with a NPE.
   
   `#!/bin/bash
   spark-submit \
   --jars /opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar \
   --master spark://spark-master:7077 \
   --total-executor-cores 10 \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --continuous \
   --source-ordering-field $3 \
   --target-base-path $2 \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
   --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \
   --hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
   --hoodie-conf hoodie.datasource.write.precombine.field=$3 \
   --hoodie-conf hoodie.clustering.async.enabled=true \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=''`
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run HoodieDeltaStreamer with async clustering and no partition path
   2. Kill the job
   3. Rerun the job
   
   **Expected behavior**
   
   The rerun of the job should continue w/o NPE.
   
   **Environment Description**
   
   * Hudi version : 0.10.1 (also saw this on 0.9.0)
   
   * Spark version : 3.0.3
   
   * Hadoop version : 3.2.0
   
   * Storage (HDFS/S3/GCS..) : AWS S3
   
   * Running on Docker? (yes/no) : No
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   `22/02/11 23:26:04 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to exception
   java.lang.NullPointerException
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.getClusteringInstantOpt(DeltaSync.java:864)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:648)
   	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   22/02/11 23:26:04 INFO HoodieDeltaStreamer: Delta Sync shutdown. Error ?true
   22/02/11 23:26:04 INFO HoodieDeltaStreamer: DeltaSync shutdown. Closing write client. Error?true
   22/02/11 23:26:04 ERROR HoodieAsyncService: Service shutdown with error
   java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException
   	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
   	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
   	at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:89)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:182)
   	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:179)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:514)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
   	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
   	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
   	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
   	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
   	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
   	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.HoodieException
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:666)
   	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.NullPointerException
   	at org.apache.hudi.utilities.deltastreamer.DeltaSync.getClusteringInstantOpt(DeltaSync.java:864)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:648)
   	... 4 more`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rohit-m-99 commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
rohit-m-99 commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1059605812


   We switched to clustering inline and stopped seeing problems. Was not able to get async clustering working unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1037426901






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1057513948


   @rohit-m-99 : Are you still looking for assistance here. of if you got it resolved, let us know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1036988345


   I could not reproduce locally w/ 0.10.1. Only suspicion I have is around schema providers. I also using parquetDFS, but have provided schema provider and could able to see async clustering running smoothly. I stopped and restarted jobs couple of times and is all good. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1036988345


   I could not reproduce locally w/ 0.10.1. Only suspicion I have is around schema providers. I also using parquetDFS, but have provided schema provider and could able to see async clustering running smoothly. I stopped and restarted jobs couple of times and is all good. 
   
   two difference from your use-case to my reprod steps:
   my key gen is simple 
   and I am setting schema provider configs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1036988345


   I could not reproduce locally w/ 0.10.1. Only suspicion I have is around schema providers. I also using parquetDFS, but have provided schema provider and could able to see async clustering running smoothly. I stopped and restarted jobs couple of times and is all good. 
   
   two difference from your use-case to my reprod steps:
   my key gen is simple 
   and I am setting schema provider configs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1037427403


   @codope @satishkotha : can you folks spot anything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1037426901


   I could not reproduce. I also tried w/ ComplexKeyGen and empty partition path and no schema provider configs. yet could not reproduce. sorry. we might need reproducible steps w/ some dataset if feasible. 
   
   
   ```
   /bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.4,org --driver-memory 8g --executor-memory 8g --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer path_to_/hudi-utilities-bundle_2.11-0.10.1.jar --props /tmp/parquet-dfs-cluster.props  --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   --source-ordering-field created_at   --table-type COPY_ON_WRITE --target-base-path file:\/\/\/tmp/hudi-deltastreamer-gh1/   --target-table gh_hudi_tbl31  --op UPSERT --hoodie-conf hoodie.clustering.async.enabled=true --continuous --source-limit 4000000 --min-sync-interval-seconds 30
   ```
   
   properties file contents
   ```
   hoodie.datasource.write.recordkey.field=other,org.id
   hoodie.datasource.write.partitionpath.field=
   hoodie.datasource.write.precombine.field=created_at
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   
   hoodie.metadata.enable=false
   hoodie.upsert.shuffle.parallelism=8
   hoodie.insert.shuffle.parallelism=8
   hoodie.delete.shuffle.parallelism=8
   hoodie.bulkinsert.shuffle.parallelism=8
   
   hoodie.deltastreamer.source.dfs.root=/dataset_path/
   
   hoodie.clustering.plan.strategy.sort.columns=created_at
   hoodie.clustering.plan.strategy.daybased.lookback.partitions=0
   hoodie.clustering.async.max.commits=2
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1054643179


   @satishkotha : Can you please follow up here when you get a chance. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1067586735


   thanks! please reach us if you are looking for further assistance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4796:
URL: https://github.com/apache/hudi/issues/4796


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org