You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sureshthalamati <gi...@git.apache.org> on 2016/05/13 19:34:08 UTC

[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

GitHub user sureshthalamati opened a pull request:

    https://github.com/apache/spark/pull/13105

    [SPARK-15315][SQL] Adding error check to  the CSV datasource writer for unsupported complex data types.

    ## What changes were proposed in this pull request?
    
    Adds error handling to the CSV writer  for unsupported complex data types.  Currently garbage gets written to the output csv files if the data frame schema has complex data types. 
    
    ## How was this patch tested?
    
    Added new unit test case.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sureshthalamati/spark csv_complex_types_SPARK-15315

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13105
    
----
commit 15f6086aca62da54ec59ea91bc0c1de8a99c346d
Author: sureshthalamati <su...@gmail.com>
Date:   2016-05-13T07:06:13Z

    Adding error check to csv datasource write for unsupported complext types.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by sureshthalamati <gi...@git.apache.org>.
Github user sureshthalamati commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63447396
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,13 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach(field => field.dataType match {
    --- End diff --
    
    Thank you @HyukjinKwon for reviewing the PR. Addressed your comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-219588474
  
    Just for a note, JSON data source is doing this in  via [`JacsonGenerator.apply()`](https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala#L36-L86) which is equivalent with [`CSVRelation.rowToString()`](https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala#L159-L165). However, it seems JSON is doing this because it needs to traverse the schema as a tree to verify if each type is supported or not. So, it seems this is doing this while writing each row.
    
    For CSV, it does not need to traverse a tree but just look the root types because it does not support nested types. So, schema can be checked faster.
    
    Therefore, it seems reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63274194
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,13 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach(field => field.dataType match {
    --- End diff --
    
    (Maybe starting with `{` for a multiple-line closure, `foreach { field =>`)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by sureshthalamati <gi...@git.apache.org>.
Github user sureshthalamati commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-219630738
  
    CC @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by sureshthalamati <gi...@git.apache.org>.
Github user sureshthalamati commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63448886
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,15 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach {
    +      field => field.dataType match {
    --- End diff --
    
    Thank your for the feed back.  Fixed it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220275858
  
    cc @cloud-fan / @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220799325
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59078/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220799323
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220795541
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-221133236
  
    thanks, merging to master and 2.0!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220799279
  
    **[Test build #59078 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59078/consoleFull)** for PR 13105 at commit [`87c6f27`](https://github.com/apache/spark/commit/87c6f27e8755c6f72e4821cf5cd1b77baf74ed4b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63450323
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,15 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach { field =>
    +      field.dataType match {
    +        case _: ArrayType | _: MapType | _: StructType =>
    +          throw new AnalysisException(
    --- End diff --
    
    Just for a note, I am less sure of `AnalysisException`. It seems JSON uses `sys.error()`. CSV data source uses `UnsupportedOperationException`, `RuntimeException` and etc. Maybe we need to make the exceptions consistent later.
    
    It seems `AnalysisException` is 
    
    >Thrown when a query fails to analyze, usually because the query itself is invalid.
    
    So, I think it would be nicer to be consistent with CSV existing behaviour or JSON one. FYI, CSV is throwing `RuntimeException` when it fails to read due to types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-219139730
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by sureshthalamati <gi...@git.apache.org>.
Github user sureshthalamati commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-221347639
  
    Thank you for merging the PR, Wenchen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by sureshthalamati <gi...@git.apache.org>.
Github user sureshthalamati commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63470142
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,15 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach { field =>
    +      field.dataType match {
    +        case _: ArrayType | _: MapType | _: StructType =>
    +          throw new AnalysisException(
    --- End diff --
    
    Thank you for taking time to review the PR.  Modified the exception to UnsupportedOperationException. It seems to make more sense for this case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13105#issuecomment-220795739
  
    **[Test build #59078 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59078/consoleFull)** for PR 13105 at commit [`87c6f27`](https://github.com/apache/spark/commit/87c6f27e8755c6f72e4821cf5cd1b77baf74ed4b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15315][SQL] Adding error check to the C...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13105#discussion_r63447722
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -172,4 +173,15 @@ class DefaultSource extends FileFormat with DataSourceRegister {
             .mapPartitions(_.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)))
         }
       }
    +
    +  private def verifySchema(schema: StructType): Unit = {
    +    schema.foreach {
    +      field => field.dataType match {
    --- End diff --
    
    Oh, I should have said this clearly. Here is a good documentation, (https://github.com/databricks/scala-style-guide#pattern-matching).
    
    I think
    
    ```scala
    schema.foreach { field =>
      field.dataType match { 
    ...
    ```
    would be nicer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org