You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Joseph Batchik (JIRA)" <ji...@apache.org> on 2015/07/17 07:59:04 UTC

[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630820#comment-14630820 ] 

Joseph Batchik commented on SPARK-8007:
---------------------------------------

[~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL queries for SPARK-8003 and SPARK-8007. My initial code is here: https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40.

The one issue I ran into though was that the catalyst package cannot access org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For prototyping purposes I copied SparkPartitionID to the catalyst package, but am wondering what would be the best way to deal with that dependency,  

Can you let me know what you think about my changes and what else needs to be done to it.

> Support resolving virtual columns in DataFrames
> -----------------------------------------------
>
>                 Key: SPARK-8007
>                 URL: https://issues.apache.org/jira/browse/SPARK-8007
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org