You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2024/01/24 17:10:02 UTC

[PR] [WIP][SQL] Disable CSV column pruning in the multi-line mode [spark]

MaxGekk opened a new pull request, #44872:
URL: https://github.com/apache/spark/pull/44872

### What changes were proposed in this pull request?

### Why are the changes needed?
To workaround the issue in the `uniVocity` parser used by the CSV datasource: https://github.com/uniVocity/univocity-parsers/issues/529

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1910339447

   do we have a test to show the data correctness issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1468485189


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   @LuciferYang Actually, you are right. Please, review this follow up PR: https://github.com/apache/spark/pull/44910



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467390850


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3212,6 +3215,26 @@ abstract class CSVSuite
     assert(CSVOptions.getAlternativeOption("codec").contains("compression"))
     assert(CSVOptions.getAlternativeOption("preferDate").isEmpty)
   }
+
+  test("SPARK-46862: column pruning in the multi-line mode") {
+    val data =
+      """"jobID","Name","City","Active"
+        |"1","DE","","Yes"
+        |"5",",","",","
+        |"3","SA","","No"
+        |"10","abcd""efgh"" \ndef","",""
+        |"8","SE","","No"""".stripMargin
+
+    withTempPath { path =>
+      Files.write(path.toPath, data.getBytes(StandardCharsets.UTF_8))
+      val df = spark.read
+        .option("multiline", "true")

Review Comment:
   `.option("multiLine", "true")`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467380240


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   https://github.com/apache/spark/blob/829e742df8251c6f5e965cb08ad454ac3ee1a389/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L103
   
   https://github.com/apache/spark/blob/829e742df8251c6f5e965cb08ad454ac3ee1a389/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L128
   
   @MaxGekk Should the check in `CSVFileFormat` be changed to `parsedOptions.isColumnPruningEnabled` too?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1910670901

   > do we have a test to show the data correctness issue?
   
   @cloud-fan I added a test for the issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467408202


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   The `schema` is used only in `CSVHeaderChecker` which is supposed to check column names in CSV and provided schema fields. It shouldn't depend on the column pruning feature at all, from my point of view.
   
   ```scala
     private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {
   ...
         if (headerLen == schemaSize) {
   ...
         } else {
           errorMessage = Some(
             s"""|Number of column in CSV header is not equal to number of fields in the schema:
                 | Header length: $headerLen, schema size: $schemaSize
                 |$source""".stripMargin)
         }
   ```
   
   `schemaSize` must be **full data schema** of CSV files, but not the required schema.
   
   Let me re-think it, and avoid the dependency from the column pruning in `CSVHeaderChecker`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1965720034

   I've filed SPARK-47180 to track potentially migrating off of Univocity to something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467408202


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   The `schema` is used only `CSVHeaderChecker` which is supposed to check column names in CSV and provided schema fields. It shouldn't depend on the column pruning feature at all, from my point of view.
   
   ```scala
     private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {
   ...
         if (headerLen == schemaSize) {
   ...
         } else {
           errorMessage = Some(
             s"""|Number of column in CSV header is not equal to number of fields in the schema:
                 | Header length: $headerLen, schema size: $schemaSize
                 |$source""".stripMargin)
         }
   ```
   
   `schemaSize` must be **full data schema** of CSV filed, but not the required schema.
   
   Let me re-think it, and avoid the dependency from the column pruning in `CSVHeaderChecker`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467410924


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3212,6 +3215,26 @@ abstract class CSVSuite
     assert(CSVOptions.getAlternativeOption("codec").contains("compression"))
     assert(CSVOptions.getAlternativeOption("preferDate").isEmpty)
   }
+
+  test("SPARK-46862: column pruning in the multi-line mode") {
+    val data =
+      """"jobID","Name","City","Active"
+        |"1","DE","","Yes"
+        |"5",",","",","
+        |"3","SA","","No"
+        |"10","abcd""efgh"" \ndef","",""
+        |"8","SE","","No"""".stripMargin
+
+    withTempPath { path =>
+      Files.write(path.toPath, data.getBytes(StandardCharsets.UTF_8))
+      val df = spark.read
+        .option("multiline", "true")

Review Comment:
   Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1952783726

   There are a bunch of libraries [listed here][1], but I don't have experience with any of them.
   
   [jackson-dataformats-text][2] looks interesting. I know we already use FasterXML to parse JSON. Perhaps we should use them to parse CSV as well.
   
   [1]:  https://mvnrepository.com/open-source/csv-libraries
   [2]: https://github.com/FasterXML/jackson-dataformats-text


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1909652504

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1911546444

   @cloud-fan @HyukjinKwon FYI, since 3.5.x and 3.4.x suffer from the same issue, I am going to backport this the branches. Thanks for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.

nchammas commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1952749131

   > To workaround the issue in the `uniVocity` parser used by the CSV datasource: [uniVocity/univocity-parsers#529](https://github.com/uniVocity/univocity-parsers/issues/529)
   
   A bit off-topic for this PR, but is uniVocity even maintained anymore?
   
   - The last release was [more than 3 years ago][1].
   - The last commit to `master` was [almost 3 years ago][2].
   - The website is [down][3].
   - There are [multiple][4] [open][5] [bugs][6] on the tracker with no indication that anyone cares.
   
   [1]: https://github.com/uniVocity/univocity-parsers/releases
   [2]: https://github.com/uniVocity/univocity-parsers/commits/master/
   [3]: https://github.com/uniVocity/univocity-parsers/issues/506
   [4]: https://github.com/uniVocity/univocity-parsers/issues/494
   [5]: https://github.com/uniVocity/univocity-parsers/issues/495
   [6]: https://github.com/uniVocity/univocity-parsers/issues/499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #44872: [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode
URL: https://github.com/apache/spark/pull/44872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467408202


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   The `schema` is used only `CSVHeaderChecker` which is supposed to check column names in CSV and provided schema fields. It shouldn't depend on the column pruning feature at all, from my point of view.
   
   ```scala
     private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {
   ...
         if (headerLen == schemaSize) {
   ...
         } else {
           errorMessage = Some(
             s"""|Number of column in CSV header is not equal to number of fields in the schema:
                 | Header length: $headerLen, schema size: $schemaSize
                 |$source""".stripMargin)
         }
   ```
   
   `schemaSize` must be f**ull data schema** of CSV filed, but not the required schema.
   
   Let me re-think it, and avoid the dependency from the column pruning in `CSVHeaderChecker`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1952764830

   is there any other popular Java libraries for parsing CSV?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1466709296


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3212,6 +3215,26 @@ abstract class CSVSuite
     assert(CSVOptions.getAlternativeOption("codec").contains("compression"))
     assert(CSVOptions.getAlternativeOption("preferDate").isEmpty)
   }
+
+  test("SPARK-46862: column pruning in the multi-line mode") {
+    val data =
+      """"jobID","Name","City","Active"
+        |"1","DE","","Yes"
+        |"5",",","",","
+        |"3","SA","","No"
+        |"10","abcd""efgh"" \ndef","",""
+        |"8","SE","","No"""".stripMargin
+
+    withTempPath { path =>
+      Files.write(path.toPath, data.getBytes(StandardCharsets.UTF_8))
+      val df = spark.read
+        .option("multiline", "true")
+        .option("header", "true")
+        .option("escape", "\"")
+        .csv(path.getCanonicalPath)
+      assert(df.count() === 5)

Review Comment:
   `count()` returns 4 without the changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44872:
URL: https://github.com/apache/spark/pull/44872#issuecomment-1911633316

   Merging to master/3.5/3.4. Thank you, @HyukjinKwon @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467393979


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3212,6 +3215,26 @@ abstract class CSVSuite
     assert(CSVOptions.getAlternativeOption("codec").contains("compression"))
     assert(CSVOptions.getAlternativeOption("preferDate").isEmpty)
   }
+
+  test("SPARK-46862: column pruning in the multi-line mode") {
+    val data =
+      """"jobID","Name","City","Active"
+        |"1","DE","","Yes"
+        |"5",",","",","
+        |"3","SA","","No"
+        |"10","abcd""efgh"" \ndef","",""
+        |"8","SE","","No"""".stripMargin
+
+    withTempPath { path =>
+      Files.write(path.toPath, data.getBytes(StandardCharsets.UTF_8))
+      val df = spark.read
+        .option("multiline", "true")

Review Comment:
   the options are case-insensitive



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44872:
URL: https://github.com/apache/spark/pull/44872#discussion_r1467408202


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala:
##########
@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
       actualReadDataSchema,
       options,
       filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema

Review Comment:
   The `schema` is used only in `CSVHeaderChecker` which is supposed to check column names in CSV and provided schema fields. It shouldn't depend on the column pruning feature at all, from my point of view.
   
   ```scala
     private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {
   ...
         if (headerLen == schemaSize) {
   ...
         } else {
           errorMessage = Some(
             s"""|Number of column in CSV header is not equal to number of fields in the schema:
                 | Header length: $headerLen, schema size: $schemaSize
                 |$source""".stripMargin)
         }
   ```
   
   `schemaSize` must be **full data schema** of CSV filed, but not the required schema.
   
   Let me re-think it, and avoid the dependency from the column pruning in `CSVHeaderChecker`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org