You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/04/26 10:05:38 UTC

[jira] [Commented] (SPARK-6292) Add RDD methods to DataFrame to preserve schema

    [ https://issues.apache.org/jira/browse/SPARK-6292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512944#comment-14512944 ] 

Reynold Xin commented on SPARK-6292:
------------------------------------

We should probably create tickets explicitly for randomSplit and sampleByKey + sampleByKeyExact.



> Add RDD methods to DataFrame to preserve schema
> -----------------------------------------------
>
>                 Key: SPARK-6292
>                 URL: https://issues.apache.org/jira/browse/SPARK-6292
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it.  For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.
> Here are a few I'd prioritize (for my use cases!)
> * randomSplit
> * sampleByKey + sampleByKeyExact
> ** Q: Should "key" be a single column, or should we support using a set of columns as a key?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org