You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by chenghao-intel <gi...@git.apache.org> on 2015/02/10 03:22:32 UTC

[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring m...

GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/4492

    [SPARK-5706] [SQL] Add json schema inferring method

    We need to provide json infer schema API instead of manually provide the json schema.
    
    ```
    val completeJsonRecord = """{"struct":{"field1": true, "field2": 92233720368547758070},
               "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
               "arrayOfString":["str1", "str2"],
               "arrayOfInteger":[1, 2147483647, -2147483648],
     @@ -90,7 +111,10 @@ object TestJsonData {
               "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
               "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
               "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
             }"""
    
    val schema = sqlContext.inferJsonSchema(completeJsonRecord)
    val jsonDF = sqlContext.jsonFile("/user/myjson", schema)
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark json_inferschema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4492
    
----
commit 0768b04c8f549f678a8654305c0e9db9059d7041
Author: Cheng Hao <ha...@intel.com>
Date:   2015-02-10T01:44:54Z

    Add json schema inferring method

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24471716
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -127,7 +135,7 @@ private[sql] object JsonRDD extends Logging {
           StructType((topLevelFields ++ structFields).sortBy(_.name))
         }
     
    -    makeStruct(resolved.keySet.toSeq, Nil)
    +    nullTypeToStringType(makeStruct(resolved.keySet.toSeq, Nil))
    --- End diff --
    
    I think we should not apply `nullTypeToStringType` at here. Otherwise, we will not know if there is any `NullType` in the inferred schema. Keeping the `NullType` is pretty important when we want to union schemas in future. For example, the type of a field can be `NullType` in one dataset and can be `LongType` in another field. If we eagerly convert `NullType` to `StringType`. After union, we will have `StringType` instead of `LongType`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24473025
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -127,7 +135,7 @@ private[sql] object JsonRDD extends Logging {
           StructType((topLevelFields ++ structFields).sortBy(_.name))
         }
     
    -    makeStruct(resolved.keySet.toSeq, Nil)
    +    nullTypeToStringType(makeStruct(resolved.keySet.toSeq, Nil))
    --- End diff --
    
    Oh, sorry, I was not clear. I meant merging two `StructType`s in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24472213
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -127,7 +135,7 @@ private[sql] object JsonRDD extends Logging {
           StructType((topLevelFields ++ structFields).sortBy(_.name))
         }
     
    -    makeStruct(resolved.keySet.toSeq, Nil)
    +    nullTypeToStringType(makeStruct(resolved.keySet.toSeq, Nil))
    --- End diff --
    
    Yeah, I understand that, but this is in the `createSchema` method, it supposes the `union` operations have been done before this method is called, doesn't it?
    see https://github.com/chenghao-intel/spark/blob/json_inferschema/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L53
    https://github.com/chenghao-intel/spark/blob/json_inferschema/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L61
    
    Sorry, I am not sure if you are talking about to merge 2 `StructType` in the future, if it is, I will update the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73640145
  
      [Test build #27160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27160/consoleFull) for   PR 4492 at commit [`0768b04`](https://github.com/apache/spark/commit/0768b04c8f549f678a8654305c0e9db9059d7041).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24472531
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -380,6 +380,11 @@ class SQLContext(@transient val sparkContext: SparkContext)
         jsonRDD(json.rdd, schema)
       }
     
    +  @Experimental
    +  def inferJsonSchema(json: String): StructType = {
    --- End diff --
    
    Instead of introducing a new interface, seems it will be better to extract general-purpose utility functions (https://issues.apache.org/jira/browse/SPARK-5260).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73838797
  
    Thank you @cjnolet for letting me know this, I will review the code when it's ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73633274
  
      [Test build #27160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27160/consoleFull) for   PR 4492 at commit [`0768b04`](https://github.com/apache/spark/commit/0768b04c8f549f678a8654305c0e9db9059d7041).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel closed the pull request at:

    https://github.com/apache/spark/pull/4492


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73824018
  
    @rxin @marmbrus any comment on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73640153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27160/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24473491
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -380,6 +380,11 @@ class SQLContext(@transient val sparkContext: SparkContext)
         jsonRDD(json.rdd, schema)
       }
     
    +  @Experimental
    +  def inferJsonSchema(json: String): StructType = {
    --- End diff --
    
    Thanks for pointing this out! I agree we should provide a utility for the general purpose, let me know if I can offer help, or should I close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by cjnolet <gi...@git.apache.org>.
Github user cjnolet commented on the pull request:

    https://github.com/apache/spark/pull/4492#issuecomment-73838447
  
    I'm actively working on a PR for SPARK-5260. I have moved  a few of the utility functions into an object called JsonSchema. I'll post it soon. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5706] [SQL] Add json schema inferring A...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4492#discussion_r24473563
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -380,6 +380,11 @@ class SQLContext(@transient val sparkContext: SparkContext)
         jsonRDD(json.rdd, schema)
       }
     
    +  @Experimental
    +  def inferJsonSchema(json: String): StructType = {
    --- End diff --
    
    How about we close it for now? Also, feel free to leave comment to that JIRA. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org