You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tanwanirahul <gi...@git.apache.org> on 2016/02/13 19:59:44 UTC

[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

GitHub user tanwanirahul opened a pull request:

    https://github.com/apache/spark/pull/11194

    [SPARK-13309][SQL] Fix type inference issue with CSV data

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tanwanirahul/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11194.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11194
    
----
commit 5edaa2ac11c4f7ad4d3a6358bccc7eae0e817660
Author: Rahul Tanwani <ra...@rahuls-macbook-pro.local>
Date:   2016-02-13T18:56:16Z

    [SPARK-13309][SQL] Fix type inference issue with CSV data

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-190072430
  
    Sorry for the delay. I'm merging this in master. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11194


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183846886
  
    cc @HyukjinKwon  want to review this one?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183883409
  
    Overall, it looks good to me. This was already merged in https://github.com/databricks/spark-csv/pull/261 and the logic looks identical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183726195
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by tanwanirahul <gi...@git.apache.org>.
Github user tanwanirahul commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-188818395
  
    Could we please merge this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183901833
  
    Actually, I have had a thought that we might have to make a class such as `TestCSVData` for dataset for testing (similarly with [TestJsonData](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala) for JSON datasource) or a class like `CSVTest` (similarly with [OrcTest](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcTest.scala) fpr ORC datasource) rather than adding test CSV files for everytime.
    
    I think this might better be done in another PR. If you agree on this, I will create an issue and PR for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11194#discussion_r52842098
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -48,7 +47,11 @@ private[csv] object CSVInferSchema {
           tokenRdd.aggregate(startType)(inferRowType(nullValue), mergeRowTypes)
     
         val structFields = header.zip(rootTypes).map { case (thisHeader, rootType) =>
    -      StructField(thisHeader, rootType, nullable = true)
    +      val dType = rootType match {
    +        case z: NullType => StringType
    --- End diff --
    
    Nit: `case z: =>..` might better be `case _: =>...`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11194#discussion_r52841904
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---
    @@ -387,4 +387,16 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
           verifyCars(carsCopy, withHeader = true)
         }
       }
    +
    +  test("Schema inference correctly identifies the datatype when data is sparse.") {
    +    val df = sqlContext.read
    +      .format("csv")
    +      .option("header", "true")
    +      .option("inferSchema", "true")
    +      .load(testFile(simpleSparseFile))
    +
    +    assert(
    +      df.schema.fields.map{field => field.dataType}.deep ==
    --- End diff --
    
    A little nit (maybe only for me) but how about using `()` instead of `{}`? I see the `{}` has been used with some indentations and new lines in Spark codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by tanwanirahul <gi...@git.apache.org>.
Github user tanwanirahul commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-185603039
  
    @HyukjinKwon @rxin Is this waiting on me? Just want to confirm it I am expected to add anything more. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by tanwanirahul <gi...@git.apache.org>.
Github user tanwanirahul commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-190074779
  
    @rxin @HyukjinKwon thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183880899
  
    @rxin Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183863028
  
    **[Test build #2542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2542/consoleFull)** for PR 11194 at commit [`5edaa2a`](https://github.com/apache/spark/commit/5edaa2ac11c4f7ad4d3a6358bccc7eae0e817660).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11194#discussion_r52842226
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala ---
    @@ -68,4 +68,10 @@ class InferSchemaSuite extends SparkFunSuite {
         assert(CSVInferSchema.inferField(DoubleType, "\\N", "\\N") == DoubleType)
         assert(CSVInferSchema.inferField(TimestampType, "\\N", "\\N") == TimestampType)
       }
    +
    +  test("Merging Nulltypes should yeild Nulltype.") {
    +    assert(
    +      CSVInferSchema.mergeRowTypes(Array(NullType),
    --- End diff --
    
    Maybe a `val` for this `CSVInferSchema.mergeRowTypes(Array(NullType),Array(NullType))`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by tanwanirahul <gi...@git.apache.org>.
Github user tanwanirahul commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183905612
  
    Yes, could be done. I personally prefer to keep code and data separate. This way:
    
    - Changing the data does not require us to compile the code.
    - Reading through the source code is friendly if it does not contain data in between.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183902081
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11194#issuecomment-183846974
  
    **[Test build #2542 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2542/consoleFull)** for PR 11194 at commit [`5edaa2a`](https://github.com/apache/spark/commit/5edaa2ac11c4f7ad4d3a6358bccc7eae0e817660).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13309][SQL] Fix type inference issue wi...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11194#discussion_r52841837
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala ---
    @@ -66,11 +69,7 @@ private[csv] object CSVInferSchema {
     
       def mergeRowTypes(first: Array[DataType], second: Array[DataType]): Array[DataType] = {
         first.zipAll(second, NullType, NullType).map { case ((a, b)) =>
    --- End diff --
    
    (This is not the part of diff but it might be great to change `((a, b))` to `(a, b)`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org