You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jamborta <ja...@gmail.com> on 2016/09/29 19:26:37 UTC

writing to s3 failing to move parquet files from temporary folder

Hi,

I have an 8 hour job (spark 2.0.0) that writes the results out to parquet
using the standard approach:

processed_images_df.write.format("parquet").save(s3_output_path)

It executes 10000 tasks and writes the results to a _temporary folder, and
in the last step (after all the tasks completed) it copies the parquet files
from the _temporary folder, but after copying about 2-3000 files it fails
with the following (first I thought this was a temporary s3 failure, but I
rerun 3 times and getting the same error):

org.apache.spark.SparkException: Job aborted.
	at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(I
nsertIntoHadoopFsRelationCommand.scala:149)
	at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIn
toHadoopFsRelationCommand.scala:115)
	at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIn
toHadoopFsRelationCommand.scala:115)
	at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
	at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelatio
nCommand.scala:115)
	at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
	at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
	at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
	at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
	at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
	at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
	at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
	at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
	at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:211)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.NoHttpResponseException:
s3-bucket.s3.amazonaws.com:443 failed to respon
d
	at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
	at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
	at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
	at
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:
283)
	at
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:259)
	at
org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:2
32)
	at
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
	at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
	at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
	at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:488)
	at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)
	at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326)
	at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
	at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143)
	at
org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2117)
	at org.jets3t.service.StorageService.copyObject(StorageService.java:898)
	at org.jets3t.service.StorageService.copyObject(StorageService.java:943)
	at
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:320)
	at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
	at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
	at org.apache.hadoop.fs.s3native.$Proxy20.copy(Unknown Source)
	at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:645)
	at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345)
	at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362)
	at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
	at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
	at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
	at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
	... 29 more



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-s3-failing-to-move-parquet-files-from-temporary-folder-tp27817.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org