You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rohit Verma <ro...@rokittech.com> on 2017/03/09 09:41:09 UTC

Spark failing while persisting sorted columns.

Hi all,

Please help me with below scenario.

While writing below query on large dataset (rowCount=100,000,000) using below query

// there are other instance of below job submitting to spark in multithreaded app.

final Dataset<Row> df = spark.read().parquet(tablePath);
// df storage is hdfs is 5.64 GB with 45 blocks.
df.select(col).na().drop().dropDuplicates(col).coalesce(20).sort(df.col(col)).coalesce(1).write().mode(SaveMode.Ignore).csv(path);

Getting below exception.

Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2991


Here are spark env details:


  *   Cores in use: 20 Total, 0 Used
  *   Memory in use: 72.2 GB Total, 0.0 B Used

And process configuration are as

"spark.cores.max", “20"
"spark.executor.memory", “3400MB"
“spark.kryoserializer.buffer.max”,”1000MB”

Any leads would be highly appreciated.

Regards
Rohit Verma

Re: Spark failing while persisting sorted columns.

Posted by Yong Zhang <ja...@hotmail.com>.

My guess is that your executor already crashed, due to OOM?. You should check the executor log, it may tell you more information.

Yong

________________________________
From: Rohit Verma <ro...@rokittech.com>
Sent: Thursday, March 9, 2017 4:41 AM
To: user
Subject: Spark failing while persisting sorted columns.

Hi all,

Please help me with below scenario.

While writing below query on large dataset (rowCount=100,000,000) using below query

// there are other instance of below job submitting to spark in multithreaded app.

final Dataset<Row> df = spark.read().parquet(tablePath);
// df storage is hdfs is 5.64 GB with 45 blocks.
df.select(col).na().drop().dropDuplicates(col).coalesce(20).sort(df.col(col)).coalesce(1).write().mode(SaveMode.Ignore).csv(path);

Getting below exception.

Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 2991

Here are spark env details:

  *   Cores in use: 20 Total, 0 Used
  *   Memory in use: 72.2 GB Total, 0.0 B Used

And process configuration are as

"spark.cores.max", “20"
"spark.executor.memory", “3400MB"
“spark.kryoserializer.buffer.max”,”1000MB”

Any leads would be highly appreciated.

Regards
Rohit Verma