You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/08 12:51:52 UTC

[GitHub] [hudi] KarthickAN opened a new issue #2154: [SUPPORT] Throwing org.apache.spark.shuffle.FetchFailedException consistently

KarthickAN opened a new issue #2154:
URL: https://github.com/apache/hudi/issues/2154


   Any insight into this issue ? I keep getting this consistenly. Need help in resolving this.
   
   
   **Stacktrace:**
   py4j.protocol.Py4JJavaError: An error occurred while calling o171.save.
   : org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9 (flatMapToPair at HoodieBloomIndex.java:302) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: java.util.concurrent.TimeoutException: Timeout waiting for task. 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64) 	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) 	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) 	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) 	at org.apache.spark.InterruptibleIterator.hasNext(Interruptible
 Iterator.scala:37) 	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) 	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:102) 	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache
 .spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 	at org.apache.spark.scheduler.Task.run(Task.scala:121) 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 	
 at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. 	at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160) 	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258) 	at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:70) 	at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:115) 	at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:74) 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257) 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) 	at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:114) 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllO
 utstanding(RetryingBlockFetcher.java:141) 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169) 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) 	... 1 more Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. 	at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) 	at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) 	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:254) 	... 14 more 
   	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
   	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
   	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
   	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
   	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1493)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2107)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
   	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
   	at org.apache.spark.rdd.RDD.take(RDD.scala:1337)
   	at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1472)
   	at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1472)
   	at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1472)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
   	at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1471)
   	at org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544)
   	at org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:164)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql
   .DataFrameWriter.save(DataFrameWriter.scala:229)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   
   
   **Configuration**
   {
   "hoodie.table.Name": "event_processed_cow_jd",
   "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.recordkey.field": "sourceid,sourceassetid,sourceeventid,value,timestamp",
   "hoodie.datasource.write.table.Type": "COPY_ON_WRITE",
   "hoodie.datasource.write.partitionpath.field": "date,sourceid",
   "hoodie.datasource.write.hive_style_partitioning": true,
   "hoodie.datasource.write.table.Name": "event_processed_cow_jd",
   "hoodie.datasource.write.operation": "insert",
   "hoodie.parquet.compression.codec": "snappy",
   "hoodie.parquet.compression.ratio": "6",
   "hoodie.parquet.small.file.limit": "104857600",
   "hoodie.parquet.max.file.size": "134217728",
   "hoodie.parquet.block.size": "134217728",
   "hoodie.copyonwrite.insert.split.size": "4880640",
   "hoodie.copyonwrite.record.size.estimate": "165",
   "hoodie.cleaner.commits.retained": 1,
   "hoodie.combine.before.insert": true,
   "hoodie.datasource.write.precombine.field": "timestamp",
   "hoodie.insert.shuffle.parallelism": 10,
   "hoodie.datasource.write.insert.drop.duplicates": true
   }
   
   **Environment Description**
   
   Hudi version : 0.6.0
   
   Spark version : 2.4.3
   
   Hadoop version : 2.8.5-amzn-1
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : No. Running on AWS Glue


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2154: [SUPPORT] Throwing org.apache.spark.shuffle.FetchFailedException consistently

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2154:
URL: https://github.com/apache/hudi/issues/2154#issuecomment-705696840


   this is typically just Spark Shuffle failure. sometimes the remote executor OOMs or something and you get this symptom on another executor. You can check the Spark UI to see if there is excessive GC on the executors or some of them died etc. 
   
   typically, throwing more memory at it, can get the job to pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] KarthickAN commented on issue #2154: [SUPPORT] Throwing org.apache.spark.shuffle.FetchFailedException consistently

Posted by GitBox <gi...@apache.org>.

KarthickAN commented on issue #2154:
URL: https://github.com/apache/hudi/issues/2154#issuecomment-706861733


   @vinothchandar Thanks vinoth. Yes, with the increased memory it had no issues. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2154: [SUPPORT] Throwing org.apache.spark.shuffle.FetchFailedException consistently

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #2154:
URL: https://github.com/apache/hudi/issues/2154#issuecomment-705696840


   this is typically just Spark Shuffle failure. sometimes the remote executor OOMs or something and you get this symptom on another executor. You can check the Spark UI to see if there is excessive GC on the executors or some of them died etc. 
   
   typically, throwing more memory at it, can get the job to pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #2154: [SUPPORT] Throwing org.apache.spark.shuffle.FetchFailedException consistently

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #2154:
URL: https://github.com/apache/hudi/issues/2154


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org