You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/02 05:40:17 UTC

[GitHub] [hudi] BalaMahesh commented on issue #4230: [SUPPORT] org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file

BalaMahesh commented on issue #4230:
URL: https://github.com/apache/hudi/issues/4230#issuecomment-1114515094

   Hello @yihua ,
   
   We are observing the same behaviour in the below scenario too. Where executor pods(running on  k8's)  are dying and after max retry attempts for the lost block, driver is killing all the executor pods and getting stuck without terminating. Because of this, spark-operator is not restarting the hudi application. Will this PR also fix this scenario ?
   
   Stack traces : 
   
   22/04/29 19:09:36 INFO pool-32-thread-1 DAGScheduler: Job 907 failed: sum at DeltaSync.java:557, took 1588.119949 s
   22/04/29 19:09:36 ERROR pool-32-thread-1 HoodieDeltaStreamer: Shutting down delta-sync due to exception
   org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 189 	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013) 
   
   22/04/29 19:09:36 INFO pool-32-thread-1 DeltaSync: Shutting down embedded timeline server
   22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closing Timeline server
   22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closing Timeline Service
   22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Stopping Javalin ...
   22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Javalin has stopped
   22/04/29 19:09:36 INFO main SparkUI: Stopped Spark web UI at http://spark-db4e548074c3c6b1-driver-svc.spark.svc:4040
   22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closed Timeline Service
   22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closed Timeline server
   22/04/29 19:09:36 INFO main KubernetesClusterSchedulerBackend: Shutting down all executors
   22/04/29 19:09:36 INFO dispatcher-CoarseGrainedScheduler KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
   22/04/29 19:09:36 WARN main ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
   22/04/29 19:09:40 INFO dispatcher-event-loop-0 MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   22/04/29 19:09:40 INFO main MemoryStore: MemoryStore cleared
   22/04/29 19:09:40 INFO main BlockManager: BlockManager stopped
   22/04/29 19:09:40 INFO main BlockManagerMaster: BlockManagerMaster stopped
   22/04/29 19:09:40 INFO dispatcher-event-loop-0 OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
   22/04/29 19:09:40 INFO main SparkContext: Successfully stopped SparkContext
   Exception in thread "main" org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 189 	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013) 	at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2$adapted(MapOutputTracker.scala:1009) 	at scala.collection.Iterator.foreach(Iterator.scala:941) 	at scala.collection.Iterator.foreach$(Iterator.scala:941) 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) 	at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1009) 	at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:821) 	at org.apache.spark.shuffle.sort.SortShuffleManager.
 getReader(SortShuffleManager.scala:133) 	at org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:63) 	at org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:57) 	at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73) 	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCh
 eckpoint(RDD.scala:373) 	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) 	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440) 	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) 	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) 	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) 	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 	at org.apache.spark.scheduler.Task.run(Task.scala:131) 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) 	at org.apache.sp
 ark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 	at java.base/java.lang.Thread.run(Unknown Source) 
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:184)
   	at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:179)
   	at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:514)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
   	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
   	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
   	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
   	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
   	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
   	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
   	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   
   
   Even after all these above logs, the hudi application get stuck and doesn't restart or process any data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org