You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/10/08 21:34:57 UTC

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2716

    [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

    This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
    
    If sampling is presented, it will infer schema from all the rows after sampling.
    
    Also, add samplingRatio for jsonFile() and jsonRDD()

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark infer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2716
    
----
commit 3603e00852f94568523bc641c756cde881616017
Author: Davies Liu <da...@gmail.com>
Date:   2014-10-08T19:29:00Z

    take more rows to infer schema, or infer the schema by sampling the RDD

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60554724
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22273/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60550377
  
    @marmbrus fixed, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61435441
  
      [Test build #22792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull) for   PR 2716 at commit [`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58561205
  
    > Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.
    
    That sounds good to me! Schema inference is powerful feature, and it would be good to see it leverage this type of schema merging logic wherever possible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58559895
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/304/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60475531
  
      [Test build #433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58611800
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull) for   PR 2716 at commit [`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60550499
  
      [Test build #22273 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull) for   PR 2716 at commit [`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61438114
  
      [Test build #508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2716#issuecomment-58577324

It will be cool to reuse the schema merging code from JsonRDD, some concerns:

1. the schema merging logic is special designed for JSON, such as it convert all the conflict types into StringType. This works for JSON, but not the best solution for Python RDD.

For example, if user have two rows:
```
{a: 2}
{a: 'abc'}
```
JsonRDD will infer type the of `a` as StringType, then it will have runtime exception when trying to access it as IntType.

SchemaRDD expects that each column should have same type, so I'd like to raise exception during inferring, not runtime. It's better to let user deal with the dirty datetypes before inferring schema .

2. it can not handle MapType (this can be added).

3. Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python. The merging part is not so huge, also we may expect different behavior in Python (more strictly).

If user really want the behavior that JsonRDD provided, it's easy to call
```
sqlContext.jsonRDD(rdd.map(json.dumps))
```

Do these seem reasonable?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61194470
  
      [Test build #495 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58615990
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21572/consoleFull) for   PR 2716 at commit [`e48d7fb`](https://github.com/apache/spark/commit/e48d7fb0800946a50922caae0062805d0fd4c371).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58556413
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21542/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58415442
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull) for   PR 2716 at commit [`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58424351
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull) for   PR 2716 at commit [`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61435364
  
    @marmbrus waiting for jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61439288
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22792/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61440930
  
    @marmbrus It's ready to go


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2716


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58414816
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21480/consoleFull) for   PR 2716 at commit [`3603e00`](https://github.com/apache/spark/commit/3603e00852f94568523bc641c756cde881616017).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58566913
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/304/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61033023
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22496/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58615814
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61314256
  
    Ping dashboard


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58424360
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21480/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60554714
  
      [Test build #22273 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22273/consoleFull) for   PR 2716 at commit [`567dc60`](https://github.com/apache/spark/commit/567dc60d7ce2c43ec7c1e24e47dc515ab5056ac0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2716#discussion_r19384056
  
    --- Diff: python/pyspark/sql.py ---
    @@ -995,19 +1038,22 @@ def registerFunction(self, name, f, returnType=StringType()):
                                           self._sc._javaAccumulator,
                                           returnType.json())
     
    -    def inferSchema(self, rdd):
    +    def inferSchema(self, rdd, samplingRatio=None):
             """Infer and apply a schema to an RDD of L{Row}.
     
    -        We peek at the first row of the RDD to determine the fields' names
    -        and types. Nested collections are supported, which include array,
    -        dict, list, Row, tuple, namedtuple, or object.
    +        If `samplingRatio` is presented, it infer schema by all of the sampled
    +        dataset.
     
    -        All the rows in `rdd` should have the same type with the first one,
    -        or it will cause runtime exceptions.
    +        Otherwise, it peeks first few rows of the RDD to determine the fields'
    +        names and types. Nested collections are supported, which include array,
    +        dict, list, Row, tuple, namedtuple, or object.
     
             Each row could be L{pyspark.sql.Row} object or namedtuple or objects,
             using dict is deprecated.
     
    +        If some of rows has different types with inferred types, it may cause
    +        runtime exceptions.
    --- End diff --
    
    When `samplingRatio` is specified, the schema is inferred by looking at the types of each row in the sampled dataset.  Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object.
    
    Each row could be L{pyspark.sql.Row} object or namedtuple or objects.  Using top level dicts is deprecated, as this datatype is used to represent Maps.
     
    If a single column has multiple distinct inferred types, it may cause runtime exceptions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58579969
  
    > Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python
    
    Ah yeah, I'm assuming that automatically converting objects to the inferred type is a good thing. When a user asks for the schema to be inferred, automatic type conversion should be expected behavior.
    
    I guess my perspective is that inferring schema is most useful when the schema is not consistent or known in advance. It should not just be a convenient way to get the schema for a dataset that has a consistent schema, though it can serve that purpose too.
    
    > If user really want the behavior that JsonRDD provided, it's easy to call
    `sqlContext.jsonRDD(rdd.map(json.dumps))`
    Do these seem reasonable?
    
    Definitely. That's precisely the workaround I suggested in [SPARK-2870](https://issues.apache.org/jira/browse/SPARK-2870), though it has the obvious JSON ser/de performance cost and seems "hacky".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58582313
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/310/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60473680
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22197/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60473797
  
      [Test build #433 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/433/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58574723
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/310/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61439280
  
      [Test build #22792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22792/consoleFull) for   PR 2716 at commit [`e678f6d`](https://github.com/apache/spark/commit/e678f6d856a31231c9bff4381a19bdf3cd6166e2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`
      * `        //       in some cases, such as when a class is enclosed in an object (in which case`
      * `abstract class UserDefinedType[UserType] extends DataType with Serializable `
      * `public abstract class UserDefinedType<UserType> extends DataType implements Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58557272
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21543/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60472563
  
      [Test build #22197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58592600
  
    Oh, I see. If inferring types and converting objects need to be done in Python, seems it will be hard to reuse code in `JsonRDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61435474
  
      [Test build #508 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/508/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60473788
  
    failed: 
    ```
    [info] - sorting without aggregation, with spill *** FAILED ***
    [info]   java.io.FileNotFoundException: /tmp/spark-local-20141024230838-6b0e/07/temp_shuffle_79289879-f38b-46f0-9f49-99c962fca570 (No such file or directory)
    [info]   at java.io.FileOutputStream.open(Native Method)
    [info]   at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    [info]   at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
    [info]   at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
    [info]   at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:300)
    [info]   at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:251)
    [info]   at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:83)
    [info]   at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:77)
    [info]   at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:83)
    [info]   at org.apache.spark.util.collection.ExternalSorter.maybeSpillCollection(ExternalSorter.scala:238)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58449107
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21489/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60491420
  
      [Test build #451 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61033022
  
      [Test build #22496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase: String)`
      * `  case class RequestExecutors(requestedTotal: Int) extends CoarseGrainedClusterMessage`
      * `  case class KillExecutors(executorIds: Seq[String]) extends CoarseGrainedClusterMessage`
      * `class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSystem: ActorSystem)`
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60495230
  
      [Test build #451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/451/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(PrimitiveType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61550257
  
    Thanks!  Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58424997
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21481/consoleFull) for   PR 2716 at commit [`f93fd84`](https://github.com/apache/spark/commit/f93fd84ce4ce7fd69ba8e12ff5343cd46116f78d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58449106
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull) for   PR 2716 at commit [`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd77773bbeac4ce).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class NullType(DataType):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58560466
  
    @davies thanks for working on this!  I'm always a little embarrassed when I have to explain how the current code works :)
    
    @yhuai, can you take a look at this?  Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61429974
  
    Sorry, for the delay here.  If you can merge I'll try to squeeze this into 1.2.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60473675
  
      [Test build #22197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22197/consoleFull) for   PR 2716 at commit [`9767b27`](https://github.com/apache/spark/commit/9767b27e89acb18697d32d38a609074ab98b59ef).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61203599
  
      [Test build #495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/495/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-60534136
  
    Minor comment on documentation wording.  Otherwise this LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58444525
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21489/consoleFull) for   PR 2716 at commit [`540d1d5`](https://github.com/apache/spark/commit/540d1d5ecfc1a3678e453bcfacd77773bbeac4ce).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-61026632
  
      [Test build #22496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22496/consoleFull) for   PR 2716 at commit [`34b5c63`](https://github.com/apache/spark/commit/34b5c63323b0d3928a907558d8ba5f310560534e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58440322
  
    @nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done).
    
    Maybe we could support optional fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58615995
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21572/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58611722
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/344/consoleFull) for   PR 2716 at commit [`29e94d5`](https://github.com/apache/spark/commit/29e94d5764d6b9d1877fd16a9041f6b0ad61b347).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58425002
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21481/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2716#issuecomment-58437772
  
    @davies I believe this PR also relates to the features discussed in [SPARK-2870](https://issues.apache.org/jira/browse/SPARK-2870). Since you are already doing schema inference over multiple rows with an optional sampling ratio in this PR, how far are we from being able to do full schema inference on RDDs of `dict`s?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org