You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2017/05/15 03:49:04 UTC

[jira] [Comment Edited] (SPARK-20703) Add an operator for writing data out

    [ https://issues.apache.org/jira/browse/SPARK-20703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009966#comment-16009966 ] 

Liang-Chi Hsieh edited comment on SPARK-20703 at 5/15/17 3:48 AM:
------------------------------------------------------------------

I've done something locally.

Currently I wrap the query to write out with a new operator and set only outputNumRows metric when the data is pulled out to write. I am wondering what other metrics supposed to have for this new operator. As the writing logic is encapsulated in different implements, it seems to me we can't easily set other metrics.

We have several RunnableCommand classes for writing data out.

For file-based relations, FileFormatWriter is used to write data out. We pass in a QueryExecution for the query to write out. We can track the action for the executed plan of this QueryExecution. 

For datasource relations, the logic of writing data out is delegated to the datasource implementations. We just pass in a DataFrame to the writing data API. The above approach to track writing action will fail for the datasource APIs if they create new DataFrame. JdbcRelationProvider is one example that it may create a new DataFrame by repartitioning original DataFrame. In this case, we can't track the writing action because the executed plan is different now.

[~rxin] Do you have any suggestions?





was (Author: viirya):
I've done something locally.

Currently I wrap the query to write out with a new operator and set only outputNumRows metric when the data is pulled out to write. I am wondering what other metrics supposed to have for this new operator.

We have several RunnableCommand classes for writing data out.

For file-based relations, FileFormatWriter is used to write data out. We pass in a QueryExecution for the query to write out. We can track the action for the executed plan of this QueryExecution. 

For datasource relations, the logic of writing data out is delegated to the datasource implementations. We just pass in a DataFrame to the writing data API. The above approach to track writing action will fail for the datasource APIs if they create new DataFrame. JdbcRelationProvider is one example that it may create a new DataFrame by repartitioning original DataFrame. In this case, we can't track the writing action because the executed plan is different now.

[~rxin] Do you have any suggestions?




> Add an operator for writing data out
> ------------------------------------
>
>                 Key: SPARK-20703
>                 URL: https://issues.apache.org/jira/browse/SPARK-20703
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Reynold Xin
>
> We should add an operator for writing data out. Right now in the explain plan / UI there is no way to tell whether a query is writing data out, and also there is no way to associate metrics with data writes. It'd be tremendously valuable to do this for adding metrics and for visibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org