You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ma...@apache.org on 2022/04/25 17:26:14 UTC

[spark] branch master updated: [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 10a643c8af3 [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions
10a643c8af3 is described below

commit 10a643c8af368cce131ef217f6ef610bf84f8b9c
Author: Hyukjin Kwon <gu...@apache.org>
AuthorDate: Mon Apr 25 20:25:56 2022 +0300

    [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to document which options do not work and are explicitly unsupported in CSV and JSON functions.
    
    ### Why are the changes needed?
    
    To avoid users to misunderstand the options.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it documents which options don't work in CSV/JSON expressions.
    
    ### How was this patch tested?
    
    I manually built the docs and checked the HTML output.
    
    Closes #36339 from HyukjinKwon/SPARK-39001.
    
    Authored-by: Hyukjin Kwon <gu...@apache.org>
    Signed-off-by: Max Gekk <ma...@gmail.com>
---
 docs/sql-data-sources-csv.md  | 18 +++++++++---------
 docs/sql-data-sources-json.md | 16 ++++++++--------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md
index 1dfe8568f9a..1be1d7446e8 100644
--- a/docs/sql-data-sources-csv.md
+++ b/docs/sql-data-sources-csv.md
@@ -63,7 +63,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>encoding</code></td>
     <td>UTF-8</td>
-    <td>For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV files</td>
+    <td>For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV files. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -99,19 +99,19 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>header</code></td>
     <td>false</td>
-    <td>For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists.</td>
+    <td>For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
     <td><code>inferSchema</code></td>
     <td>false</td>
-    <td>Infers the input schema automatically from data. It requires one extra pass over the data.</td>
+    <td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
     <td><code>enforceSchema</code></td>
     <td>true</td>
-    <td>If it is set to <code>true</code>, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to <code>false</code>, the schema will be validated against all headers in CSV files in the case when the <code>header</code> option is set to <code>true</code>. Field names in the schema and column names in CSV headers are checked by their positions taking into account <code>spark.sql.caseSensitive</code> [...]
+    <td>If it is set to <code>true</code>, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to <code>false</code>, the schema will be validated against all headers in CSV files in the case when the <code>header</code> option is set to <code>true</code>. Field names in the schema and column names in CSV headers are checked by their positions taking into account <code>spark.sql.caseSensitive</code> [...]
     <td>read</td>
   </tr>
   <tr>
@@ -186,7 +186,7 @@ Data source options of CSV can be set via:
     <td>Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark tries to parse only required columns in CSV under column pruning. Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by <code>spark.sql.csv.parser.columnPruning.enabled</code> (enabled by default).<br>
     <ul>
       <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to [...]
-      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.</li>
       <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
     </ul>
     </td>
@@ -201,7 +201,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>multiLine</code></td>
     <td>false</td>
-    <td>Parse one record, which may span multiple lines, per file.</td>
+    <td>Parse one record, which may span multiple lines, per file. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -213,7 +213,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>samplingRatio</code></td>
     <td>1.0</td>
-    <td>Defines fraction of rows used for schema inferring.</td>
+    <td>Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -231,7 +231,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>lineSep</code></td>
     <td><code>\r</code>, <code>\r\n</code> and <code>\n</code> (for reading), <code>\n</code> (for writing)</td>
-    <td>Defines the line separator that should be used for parsing/writing. Maximum length is 1 character.</td>
+    <td>Defines the line separator that should be used for parsing/writing. Maximum length is 1 character. CSV built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -251,7 +251,7 @@ Data source options of CSV can be set via:
   <tr>
     <td><code>compression</code></td>
     <td>(none)</td>
-    <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, <code>gzip</code>, <code>lz4</code>, <code>snappy</code> and <code>deflate</code>).</td>
+    <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (<code>none</code>, <code>bzip2</code>, <code>gzip</code>, <code>lz4</code>, <code>snappy</code> and <code>deflate</code>). CSV built-in functions ignore this option.</td>
     <td>write</td>
   </tr>
 </table>
diff --git a/docs/sql-data-sources-json.md b/docs/sql-data-sources-json.md
index f75efd1108a..25e9db10978 100644
--- a/docs/sql-data-sources-json.md
+++ b/docs/sql-data-sources-json.md
@@ -127,13 +127,13 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>primitivesAsString</code></td>
     <td><code>false</code></td>
-    <td>Infers all primitive values as a string type.</td>
+    <td>Infers all primitive values as a string type. JSON built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
     <td><code>prefersDecimal</code></td>
     <td><code>false</code></td>
-    <td>Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.</td>
+    <td>Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. JSON built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -172,7 +172,7 @@ Data source options of JSON can be set via:
     <td>Allows a mode for dealing with corrupt records during parsing.<br>
     <ul>
       <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorrupt [...]
-      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.</li>
       <li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
     </ul>
     </td>
@@ -205,7 +205,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>multiLine</code></td>
     <td><code>false</code></td>
-    <td>Parse one record, which may span multiple lines, per file.</td>
+    <td>Parse one record, which may span multiple lines, per file. JSON built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -217,13 +217,13 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>encoding</code></td>
     <td>Detected automatically when <code>multiLine</code> is set to <code>true</code> (for reading), <code>UTF-8</code> (for writing)</td>
-    <td>For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files.</td>
+    <td>For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files. JSON built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
     <td><code>lineSep</code></td>
     <td><code>\r</code>, <code>\r\n</code>, <code>\n</code> (for reading), <code>\n</code> (for writing)</td>
-    <td>Defines the line separator that should be used for parsing.</td>
+    <td>Defines the line separator that should be used for parsing. JSON built-in functions ignore this option.</td>
     <td>read/write</td>
   </tr>
   <tr>
@@ -235,7 +235,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>dropFieldIfAllNull</code></td>
     <td><code>false</code></td>
-    <td>Whether to ignore column of all null values or empty array during schema inference.</td>
+    <td>Whether to ignore column of all null values or empty array during schema inference. JSON built-in functions ignore this option.</td>
     <td>read</td>
   </tr>
   <tr>
@@ -259,7 +259,7 @@ Data source options of JSON can be set via:
   <tr>
     <td><code>compression</code></td>
     <td>(none)</td>
-    <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).</td>
+    <td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). JSON built-in functions ignore this option.</td>
     <td>write</td>
   </tr>
   <tr>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org