You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/02/24 07:40:47 UTC

[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20666

    [SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document

    ## What changes were proposed in this pull request?
    
    Clarify JSON and CSV reader behavior in document.
    
    JSON doesn't support partial results for corrupted records.
    CSV only supports partial results for the records with more or less tokens.
    
    ## How was this patch tested?
    
    Pass existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23448-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20666
    
----
commit 4ad330b1def558e17dfb693d428e1bd69248e5a3
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-02-24T07:15:11Z

    Clarify JSON and CSV parser behavior.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170816958
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    --- End diff --
    
    ah, I think we need to explain that, for CSV a record with less/more tokens is not a malformed record.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87642/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87699/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87711/testReport)** for PR 20666 at commit [`654a59b`](https://github.com/apache/spark/commit/654a59bc23da932cff371cd2c01c359b1b597228).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1106/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Thanks @HyukjinKwon @cloud-fan!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170847481
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +394,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  A record with less/more tokens than schema is not a corrupted record. \
    +                  It supports partial result for such records. When it meets a record having \
    --- End diff --
    
    `It supports partial result for such records.` this doesn't look like very useful, I think the following sentences explain this case well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1077/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170510529
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    +                  field in an output schema. It does not support partial results. Even just one \
    --- End diff --
    
    I think we can drop the last sentence. The doc is pretty clear saying `and sets other  fields to null`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87663/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170425099
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  It supports partial result for the records just with less or more tokens \
    +                  than the schema. When it meets a malformed record whose parsed tokens is \
    --- End diff --
    
    How about ` a malformed record whose parsed tokens is` -> ` a malformed record having the length of parsed tokens shorter than the length of a schema`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1026/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87642/testReport)** for PR 20666 at commit [`4ad330b`](https://github.com/apache/spark/commit/4ad330b1def558e17dfb693d428e1bd69248e5a3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87642/testReport)** for PR 20666 at commit [`4ad330b`](https://github.com/apache/spark/commit/4ad330b1def558e17dfb693d428e1bd69248e5a3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87689/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    That's not related to this change. The issue itself seems to be a behaviour change between 1.6 and 2.x for treating empty string as null or not in double and float, which is rather a corner case and which looks, yea, an issue. Let me try to fix it while I'm here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87644/testReport)** for PR 20666 at commit [`4400cf2`](https://github.com/apache/spark/commit/4400cf2eb4d3b1b37c9e299e91db6e4a032e0c3a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20666


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87663/testReport)** for PR 20666 at commit [`1d03d3b`](https://github.com/apache/spark/commit/1d03d3b248821a05dfd2751eeb0c8b657ebc9073).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by fuqiliang <gi...@git.apache.org>.
Github user fuqiliang commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Hi, guys, I am a spark user.
    I have a question for this "JSON doesn't support partial results for corrupted records." behavior.
    In spark 1.6, the partial results is given, but when upgraded to 2.2, I loss some meaningful datas in my json file.
    
    Could i get those datas come back in spark 2+?  @viirya 
    Thanks for help.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87720/testReport)** for PR 20666 at commit [`fe260c9`](https://github.com/apache/spark/commit/fe260c9058125e878931fa0cdd0f5312b6e3a1ff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87689/testReport)** for PR 20666 at commit [`4f9b148`](https://github.com/apache/spark/commit/4f9b14803f3eff8057e52e36d13f074ec917bde6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by fuqiliang <gi...@git.apache.org>.
Github user fuqiliang commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Hi, Thanks for help. 
         And do we have follow-up stories on the data loss in spark2.2?
         I have tried to use `sql.read.format("my.spark.sql.execution.datasources.json").load(file_path)` by shading the spark-sql_2.10:1.6.jar to my.spark.sql-shade.jar, But found the json.DefaultSource from spark1.6 can not be used in spark2.2 with error:
    
    > Exception in thread "main" org.apache.spark.sql.AnalysisException: my.spark.sql.execution.datasources.json is not a valid Spark SQL Data Source.;
    > 	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
    > 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    > 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
    > 	at com.test.shade.SparkTest$.main(SparkTest.scala:32)
    > 	at com.test.shade.SparkTest.main(SparkTest.scala)
    
    So can I have some other suggestions ? @HyukjinKwon 
    
    Thanks a lot.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87720/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87689/testReport)** for PR 20666 at commit [`4f9b148`](https://github.com/apache/spark/commit/4f9b14803f3eff8057e52e36d13f074ec917bde6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87721/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87711/testReport)** for PR 20666 at commit [`654a59b`](https://github.com/apache/spark/commit/654a59bc23da932cff371cd2c01c359b1b597228).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170499119
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    +                  field in an output schema. It doesn't support partial results. Even just one \
    --- End diff --
    
    Ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    cc @cloud-fan @HyukjinKwon To keep CSV reader behavior for corrupted records, we don't bother to refactoring. But we should update the document and explicitly disable partial results for corrupted records.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170847103
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +394,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  A record with less/more tokens than schema is not a corrupted record. \
    +                  It supports partial result for such records. When it meets a record having \
    +                  the length of parsed tokens shorter than the length of a schema, it sets \
    +                  ``null`` for extra fields. When a length of tokens is longer than a schema, \
    +                  it drops extra tokens.
    --- End diff --
    
    ```
    When it meets a record having fewer tokens than the length of the schema, it sets ``null`` for extra fields.
    When the record has more tokens than the length of the schema, it drops extra tokens.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1117/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170418454
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -550,12 +552,14 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
        *    during parsing. It supports the following case-insensitive modes.
        *   <ul>
    -   *     <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
    -   *     the malformed string into a field configured by `columnNameOfCorruptRecord`. To keep
    +   *     <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a
    +   *     field configured by `columnNameOfCorruptRecord`, and sets other fields to `null`. To keep
        *     corrupt records, an user can set a string type field named `columnNameOfCorruptRecord`
        *     in an user-defined schema. If a schema does not have the field, it drops corrupt records
    -   *     during parsing. When a length of parsed CSV tokens is shorter than an expected length
    -   *     of a schema, it sets `null` for extra fields.</li>
    +   *     during parsing. It supports partial result for the records just with less or more tokens
    --- End diff --
    
    Yes. Will update accordingly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170797261
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    --- End diff --
    
    I think this is talking about a corrupted record, not a record with less/more tokens. If CSV parser fails to parse a record, all other fields are set to null.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    retest this please.
    
    On Tue, Feb 27, 2018, 1:43 PM UCB AMPLab <no...@github.com> wrote:
    
    > Test FAILed.
    > Refer to this link for build results (access rights to CI server needed):
    > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87689/
    > Test FAILed.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/20666#issuecomment-368745583>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAEM9_SPWZjQHnwxsJoM6rNQakwJMV1Xks5tY4fYgaJpZM4SRy8S>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87735/testReport)** for PR 20666 at commit [`daa326d`](https://github.com/apache/spark/commit/daa326d9973b837f2b62d28c9382fbc4b8339659).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87641/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87735/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170846293
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +394,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  A record with less/more tokens than schema is not a corrupted record. \
    --- End diff --
    
    `.. not a corrupted record to CSV.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170510616
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    --- End diff --
    
    we can't say `and sets other fields to null`, as it's not the case for CSV


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87644/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87641/testReport)** for PR 20666 at commit [`4ad330b`](https://github.com/apache/spark/commit/4ad330b1def558e17dfb693d428e1bd69248e5a3).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87711/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170425193
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    --- End diff --
    
    I think we should say `it implicitly adds ... if a corrupted record is found ` while we are here? I think it only adds `` `columnNameOfCorruptRecord` `` when it meets a corrupted record during schema inference.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170425254
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    +                  field in an output schema. It doesn't support partial results. Even just one \
    --- End diff --
    
    It's trivial but how about we avoid an abbreviation like `dosen't`? It's usually what I do for doc although I am not sure if it actually matters.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by fuqiliang <gi...@git.apache.org>.
Github user fuqiliang commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    for specify, the json file (Sanity4.json) is
    `{"a":"a1","int":1,"other":4.4}
    {"a":"a2","int":"","other":""}`
    
    code :
    
    > val config = new SparkConf().setMaster("local[5]").setAppName("test")
    > val sc = SparkContext.getOrCreate(config)
    > val sql = new SQLContext(sc)
    > 
    >  val file_path = this.getClass.getClassLoader.getResource("Sanity4.json").getFile
    >  val df = sql.read.schema(null).json(file_path)
    >  df.show(30)
    
    
    
    then in spark 1.6, result is
    +---+----+-----+
    |  a| int|other|
    +---+----+-----+
    | a1|   1|  4.4|
    | a2|null| null|
    +---+----+-----+
    
    root
     |-- a: string (nullable = true)
     |-- int: long (nullable = true)
     |-- other: double (nullable = true)
    
    but in spark 2.2, result is
    +----+----+-----+
    |   a| int|other|
    +----+----+-----+
    |  a1|   1|  4.4|
    |null|null| null|
    +----+----+-----+
    
    root
     |-- a: string (nullable = true)
     |-- int: long (nullable = true)
     |-- other: double (nullable = true)
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1105/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1087/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170510027
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    --- End diff --
    
    Ah I thought this:
    
    ```
    When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` field in an output schema.
    ```
    
    describes schema inference because it adds `columnNameOfCorruptRecord` column if malformed record was found during schema inference. I mean ..:
    
    ```scala
    scala> spark.read.json(Seq("""{"a": 1}""", """{"a":""").toDS).printSchema()
    root
     |-- _corrupt_record: string (nullable = true)
     |-- a: long (nullable = true)
    
    
    scala> spark.read.json(Seq("""{"a": 1}""").toDS).printSchema()
    root
     |-- a: long (nullable = true)
    ```
    
    but yes I think I misread it. Here we describe things mainly about malformed records already.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87720/testReport)** for PR 20666 at commit [`fe260c9`](https://github.com/apache/spark/commit/fe260c9058125e878931fa0cdd0f5312b6e3a1ff).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged to master and branch-2.3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87699/testReport)** for PR 20666 at commit [`4f9b148`](https://github.com/apache/spark/commit/4f9b14803f3eff8057e52e36d13f074ec917bde6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87663/testReport)** for PR 20666 at commit [`1d03d3b`](https://github.com/apache/spark/commit/1d03d3b248821a05dfd2751eeb0c8b657ebc9073).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87721/testReport)** for PR 20666 at commit [`daa326d`](https://github.com/apache/spark/commit/daa326d9973b837f2b62d28c9382fbc4b8339659).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87644/testReport)** for PR 20666 at commit [`4400cf2`](https://github.com/apache/spark/commit/4400cf2eb4d3b1b37c9e299e91db6e4a032e0c3a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87721/testReport)** for PR 20666 at commit [`daa326d`](https://github.com/apache/spark/commit/daa326d9973b837f2b62d28c9382fbc4b8339659).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1095/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170847584
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,14 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. It does not support partial results. To keep corrupt \
    --- End diff --
    
    `It does not support partial results.` I think we don't need to mention this for json.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87735/testReport)** for PR 20666 at commit [`daa326d`](https://github.com/apache/spark/commit/daa326d9973b837f2b62d28c9382fbc4b8339659).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170417628
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -550,12 +552,14 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
        * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
        *    during parsing. It supports the following case-insensitive modes.
        *   <ul>
    -   *     <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
    -   *     the malformed string into a field configured by `columnNameOfCorruptRecord`. To keep
    +   *     <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a
    +   *     field configured by `columnNameOfCorruptRecord`, and sets other fields to `null`. To keep
        *     corrupt records, an user can set a string type field named `columnNameOfCorruptRecord`
        *     in an user-defined schema. If a schema does not have the field, it drops corrupt records
    -   *     during parsing. When a length of parsed CSV tokens is shorter than an expected length
    -   *     of a schema, it sets `null` for extra fields.</li>
    +   *     during parsing. It supports partial result for the records just with less or more tokens
    --- End diff --
    
    I think there are same instances to update `DataStreamReader`, `readwriter.py` and `streaming.py` too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87641/testReport)** for PR 20666 at commit [`4ad330b`](https://github.com/apache/spark/commit/4ad330b1def558e17dfb693d428e1bd69248e5a3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170845040
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    --- End diff --
    
    Ok. Added.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1029/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1051/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    **[Test build #87699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87699/testReport)** for PR 20666 at commit [`4f9b148`](https://github.com/apache/spark/commit/4f9b14803f3eff8057e52e36d13f074ec917bde6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170499102
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -393,13 +395,16 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                  record, and puts the malformed string into a field configured by \
    -                  ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                  a string type field named ``columnNameOfCorruptRecord`` in an \
    -                  user-defined schema. If a schema does not have the field, it drops corrupt \
    -                  records during parsing. When a length of parsed CSV tokens is shorter than \
    -                  an expected length of a schema, it sets `null` for extra fields.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  It supports partial result for the records just with less or more tokens \
    +                  than the schema. When it meets a malformed record whose parsed tokens is \
    --- End diff --
    
    Ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1027/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser be...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20666#discussion_r170498519
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -209,13 +209,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                          set, it uses the default value, ``PERMISSIVE``.
     
    -                * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \
    -                 record, and puts the malformed string into a field configured by \
    -                 ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \
    -                 a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
    -                 schema. If a schema does not have the field, it drops corrupt records during \
    -                 parsing. When inferring a schema, it implicitly adds a \
    -                 ``columnNameOfCorruptRecord`` field in an output schema.
    +                * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \
    +                  into a field configured by ``columnNameOfCorruptRecord``, and sets other \
    +                  fields to ``null``. To keep corrupt records, an user can set a string type \
    +                  field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
    +                  schema does not have the field, it drops corrupt records during parsing. \
    +                  When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
    --- End diff --
    
    When users set a string type field named `columnNameOfCorruptRecord` in an user-defined schema, even no corrupted record, I think the field is still added. Or I misread this sentence?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20666
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org