You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2015/02/10 08:48:28 UTC

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/4498

    [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns

    Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark create

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4498
    
----
commit 9526e97ee375739771a6bf51014cc7dd8b920aaf
Author: Davies Liu <da...@databricks.com>
Date:   2015-02-10T07:41:57Z

    createDataFrame from RDD with columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73786529
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27229/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24466017
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,81 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * It does not support nested StructType, use createDataFrame(rdd, schema) instead.
    +   *
    +   * For example:
    +   *
    +   * {{{
    +   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +   *
    +   *  val people = sc.textFile("examples/src/main/resources/people.txt").map(
    +   *      _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
    +   *  val dataFrame = sqlContext.createDataFrame(people, Seq("name", "age"))
    +   *  dataFrame.printSchema
    +   *  // root
    +   *  // |-- name: string (nullable = false)
    +   *  // |-- age: integer (nullable = true)
    +   * }}}
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    +    def inferType: PartialFunction[Any, DataType] = ScalaReflection.typeOfObject orElse {
    +      case map: Map[_, _] =>
    --- End diff --
    
    Is the `Map` at here a `scala.collection.Map` or `Predef.Map` (`scala.collection.immutable.Map`)? (We should use `scala.collection.Map` at here.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24419929
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -186,7 +189,7 @@ def inferSchema(self, rdd, samplingRatio=None):
                         warnings.warn("Some of types cannot be determined by the "
                                       "first 100 rows, please try again with sampling")
             else:
    -            if samplingRatio > 0.99:
    +            if samplingRatio < 0.99:
    --- End diff --
    
    Oh, seems the code in master is wrong. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73786518
  
      [Test build #27229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27229/consoleFull) for   PR 4498 at commit [`d1bd8f2`](https://github.com/apache/spark/commit/d1bd8f2ff36a40ca4d46a31098ebe160bae56034).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73657904
  
      [Test build #27188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27188/consoleFull) for   PR 4498 at commit [`9526e97`](https://github.com/apache/spark/commit/9526e97ee375739771a6bf51014cc7dd8b920aaf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73828122
  
      [Test build #27255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27255/consoleFull) for   PR 4498 at commit [`08469c1`](https://github.com/apache/spark/commit/08469c1ae33e5fab54748098c03ce096a50e2404).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73802840
  
      [Test build #596 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/596/consoleFull) for   PR 4498 at commit [`d1bd8f2`](https://github.com/apache/spark/commit/d1bd8f2ff36a40ca4d46a31098ebe160bae56034).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73815361
  
      [Test build #596 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/596/consoleFull) for   PR 4498 at commit [`d1bd8f2`](https://github.com/apache/spark/commit/d1bd8f2ff36a40ca4d46a31098ebe160bae56034).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24396211
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -186,7 +189,7 @@ def inferSchema(self, rdd, samplingRatio=None):
                         warnings.warn("Some of types cannot be determined by the "
                                       "first 100 rows, please try again with sampling")
             else:
    -            if samplingRatio > 0.99:
    +            if samplingRatio < 0.99:
    --- End diff --
    
    isn't this change wrong? sampling should be off if ratio > 0.99?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24435747
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -287,6 +293,62 @@ def applySchema(self, rdd, schema):
             df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
             return DataFrame(df, self)
     
    +    def createDataFrame(self, rdd, schema=None, samplingRatio=None):
    --- End diff --
    
    @rxin  Is there a better name for this? `createDataFrame` is still too long (longer than 'applySchema')


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73810578
  
    LGTM. @yhuai please take a look at the type inference stuff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73657967
  
    @marmbrus @yhuai Does this work for you?  cc @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73777202
  
      [Test build #27229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27229/consoleFull) for   PR 4498 at commit [`d1bd8f2`](https://github.com/apache/spark/commit/d1bd8f2ff36a40ca4d46a31098ebe160bae56034).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73821969
  
    @pwendell I think this PR is ready to go, just wait for jenkins or not. (The last commit just remove a API and the test for it)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73816113
  
      [Test build #27239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27239/consoleFull) for   PR 4498 at commit [`c80a7a9`](https://github.com/apache/spark/commit/c80a7a91427a61e3e4a53cdb73eb8ede86eb5ab8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24396268
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,80 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    --- End diff --
    
    we might want to use the same logic defined in jsonrdd?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73668328
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27188/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73802908
  
    @rxin this should be ready, please give another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73816124
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27239/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24429007
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,80 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    --- End diff --
    
    Seems our comment needs to say that it cannot support inner structs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24434802
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,80 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    --- End diff --
    
    @yhuai `ScalaReflection.typeOfObject` would help a lot, but others in JsonRDD can not fit here well.
    
    I think this API will be useful when user has an RDD of Tuple/Seq, it can be easily to be converted into RDD of Row, but don't want to figure out the sql types manually.
    
    I hadn't figure out the right scope of this API now, for example, it cannot support nested StructType, should it support MapType and ArrayType? Maybe the basic version is useful enough, the advanced version (use StructType) will help all other rare cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4498


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24429599
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,80 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    --- End diff --
    
    If nested structures are not supported, you can use `typeOfObject` in `ScalaReflection` to simplify the logic at here. If we want to support nested structures, it will be better to generalize the logic used by JSON and use that at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73803379
  
      [Test build #27239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27239/consoleFull) for   PR 4498 at commit [`c80a7a9`](https://github.com/apache/spark/commit/c80a7a91427a61e3e4a53cdb73eb8ede86eb5ab8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73821777
  
    After talk to @yhuai offline, he suggested that we could hold on for Scala API for createDataFrame(rdd, columns), it's not so useful right now. We can revisit it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24429187
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -264,8 +262,80 @@ class SQLContext(@transient val sparkContext: SparkContext)
       }
     
       @DeveloperApi
    -  def applySchema(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    -    applySchema(rowRDD.rdd, schema);
    +  def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
    +    createDataFrame(rowRDD.rdd, schema)
    +  }
    +
    +  /**
    +   * Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s by applying
    +   * a seq of names of columns to this RDD, the data type for each column will
    +   * be inferred by the first row.
    +   *
    +   * @param rowRDD an RDD of Row
    +   * @param columns names for each column
    +   * @return DataFrame
    +   */
    +  def createDataFrame(rowRDD: RDD[Row], columns: Seq[String]): DataFrame = {
    --- End diff --
    
    Can you give some use cases for this interface? I feel if we ask users to create Row objects, they should know the types of fields in Rows and create the StructType representing the schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4498#discussion_r24447035
  
    --- Diff: python/pyspark/sql/context.py ---
    @@ -287,6 +293,62 @@ def applySchema(self, rdd, schema):
             df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
             return DataFrame(df, self)
     
    +    def createDataFrame(self, rdd, schema=None, samplingRatio=None):
    --- End diff --
    
    Talked offline, we did figure out a better name than it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP] [SPARK-5704] [SQL] [PySpark] createDataF...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73668321
  
      [Test build #27188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27188/consoleFull) for   PR 4498 at commit [`9526e97`](https://github.com/apache/spark/commit/9526e97ee375739771a6bf51014cc7dd8b920aaf).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73828125
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27255/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5704] [SQL] [PySpark] createDataFrame f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4498#issuecomment-73821807
  
      [Test build #27255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27255/consoleFull) for   PR 4498 at commit [`08469c1`](https://github.com/apache/spark/commit/08469c1ae33e5fab54748098c03ce096a50e2404).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org