You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2017/07/13 05:29:00 UTC
[jira] [Comment Edited] (SPARK-20703) Add an operator for writing data out

    [ https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085017#comment-16085017 ] 

Liang-Chi Hsieh edited comment on SPARK-20703 at 7/13/17 5:28 AM:
------------------------------------------------------------------

Thanks [~stevel@apache.org] for voicing this.

For the latency on object store, I am not sure the actual implementation of the Committer for the object store (S3?) you use. I take the following S3 Committer as an example:

https://github.com/rdblue/s3committer/blob/44d41d475488edf60f5dbbe2224c0ef9227e55dc/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L213

The temp file for a task in Spark's FileCommitProtocol would use the work path of the S3 Committer as the staging path, and it is a path on local FS.

So I assume that the latency will only be there if the staging path is not based on a local path?

For the FNFE issue, currently the call to getFileSize is happened after the current OutputWriter is closed. I agree that we should catche all IOEs in getFileSize, so a patch is welcome, but I am also curious that the cases that the file is not there, do we have another change to materialize the file later? Otherwise, should the commit of the task later be failed if the temp file is not there?











was (Author: viirya):
Thanks [~stevel@apache.org] for voicing this.

For the latency on object store, I am not sure the actual implementation of the Committer for the object store (S3?) you use. I take the following S3 Committer as an example:

https://github.com/rdblue/s3committer/blob/44d41d475488edf60f5dbbe2224c0ef9227e55dc/src/main/java/com/netflix/bdp/s3/S3MultipartOutputCommitter.java#L213

The temp file for a task in Spark's FileCommitProtocol would use the work path of the S3 Committer as the staging path, and it is a path on local FS.

So I assume that the latency will only be there if the staging path is not based on a local path?

For the FNFE issue, currently the call to getFileSize is happened after the current OutputWriter is closed. I agree that we should catche all IOEs in getFileSize, so a patch is welcome, but I am also curious that the cases that the file is not there, do we have another change to materialize the file later?










> Add an operator for writing data out
> ------------------------------------
>
>                 Key: SPARK-20703
>                 URL: https://issues.apache.org/jira/browse/SPARK-20703
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Reynold Xin
>            Assignee: Liang-Chi Hsieh
>             Fix For: 2.3.0
>
>
> We should add an operator for writing data out. Right now in the explain plan / UI there is no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. It'd be tremendously valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org