You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2015/02/27 06:52:25 UTC

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/4806

    [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true

    Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-6052

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark jsonArrayContainsNull

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4806.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4806
    
----
commit 05eab9d06b1f7f2311660aded13fac38a5b86ad4
Author: Yin Huai <yh...@databricks.com>
Date:   2015-02-27T05:47:31Z

    Change containsNull to true.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76343045
  
      [Test build #28050 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28050/consoleFull) for   PR 4806 at commit [`05eab9d`](https://github.com/apache/spark/commit/05eab9d06b1f7f2311660aded13fac38a5b86ad4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76375347
  
    @yhuai is this pr also dealing with one of the problems reported in #4729? Looks like it is. If so, can we solve the problem in the original pr? Or please suggest on it, instead of just making each sub-problem of it as new pr such as #4782 and this? It makes the original pr hard to manage and modify. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76349422
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28050/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4806


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76730006
  
    I think your suggestion to completely remove nullablity may be considered if it is useless at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76731541
  
    Yeah, we introduced it for potential optimizations, but seems that it's causing more troubles. We decided to ignore nullability in Parquet and JSON data sources because this seems to be making more sense for most scenarios, especially when dealing with "dirty" datasets.
    
    However, completely ignoring nullability in Spark SQL also means that we lose part of the schema information, which affects data sources like Avro, ProtocolBuffer, and Thrift. Not quite sure whether this is a good idea for now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76729418
  
    Merging to master and branch-1.3, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76413623
  
    Besides, I think that it is weird to manually set up the `containsNull` for JSON schema inference. Sampling should not be an issue because you can also argue that we may miss arrays with different column types.
    
    So the main point is still the problem of inserting JSON data to parquet data source table. I did in #4729 just copy the schema of JSON data and modify its `containsNull` then use it for insertion, without actually modifying the schema of the JSON data.
    
    Both solutions are working on the unit test. @liancheng @yhuai you can decide which one is more proper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76729389
  
    @viirya Making complex types in JSON relation always nullable could be more robust, and makes more sense for most common use cases. We don't want to get a schema with wrong nullability when we  happened to only sampled records without nulls. So this problem itself should be fixed anyway regardless of SPARK-5950.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6052][SQL]In JSON schema inference, we ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4806#issuecomment-76349418
  
      [Test build #28050 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28050/consoleFull) for   PR 4806 at commit [`05eab9d`](https://github.com/apache/spark/commit/05eab9d06b1f7f2311660aded13fac38a5b86ad4).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org