You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/04/17 14:29:00 UTC

[jira] [Commented] (HADOOP-16994) hadoop output to ftp gives rename error on FileOutputCommitter.mergePaths

    [ https://issues.apache.org/jira/browse/HADOOP-16994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085810#comment-17085810 ] 

Steve Loughran commented on HADOOP-16994:
-----------------------------------------

The File output committer uses renames to move task attempts into a completed task attempt dir, then when the jobis committed, into the final destination. We need to do this for resilience. All filesystems for which it works must support atomic directory renames.

HDFS, abfs: requirements are met
FTP doesnt support rename: it fails
S3A doesnt have the atomicity and rename is O(data): we have a special committer for it which uses multipart upload

Afraid FTP is not going to work. Either you implement your own committer using whatever limited operations FTP uses (hard-to-impossible), or use a different FS (NFS? some shared local mountpoint?)

closing as wontfix.

> hadoop output to ftp gives rename error on FileOutputCommitter.mergePaths
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-16994
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16994
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Talha Azaz
>            Priority: Major
>
> i'm using spark in kubernetes cluster mode and trying to write read data from DB and write in parquet format to ftp server. I'm using hadoop ftp filesystem for writing. When the task completes, it tries to rename /sensor_values/1585353600000/_temporary/0/_temporary/attempt_20200414075519_0000_m_000021_21/part-00021-d7cef14e-151b-4c3b-a8d8-4e9ab33e80f9-c000.snappy.parquet
> to 
> /sensor_values/1585353600000/part-00021-d7cef14e-151b-4c3b-a8d8-4e9ab33e80f9-c000.snappy.parquet
> But the problem is it gives the following error:
> ```
> Lost task 21.0 in stage 0.0 (TID 21, 10.233.90.137, executor 3): org.apache.spark.SparkException: Task failed while writing rows.
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Cannot rename source: ftp://user:pass@host/sensor_values/1585353600000/_temporary/0/_temporary/attempt_20200414075519_0000_m_000021_21/part-00021-d7cef14e-151b-4c3b-a8d8-4e9ab33e80f9-c000.snappy.parquet to ftp://user:pass@host/sensor_values/1585353600000/part-00021-d7cef14e-151b-4c3b-a8d8-4e9ab33e80f9-c000.snappy.parquet -only same directory renames are supported
>  at org.apache.hadoop.fs.ftp.FTPFileSystem.rename(FTPFileSystem.java:674)
>  at org.apache.hadoop.fs.ftp.FTPFileSystem.rename(FTPFileSystem.java:613)
>  at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:472)
>  at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
>  at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
>  at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
>  at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
>  at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
>  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:225)
>  at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  ... 10 more
> ```
> I have done the same thing on Azure filesystem using same spark and hadoop implimentation. 
> Is there any configuration in hadoop or spark that needs to be changed or is it just not supported in hadoop ftp file System?
> Thanks a lot!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org