You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/28 07:16:06 UTC

[GitHub] [spark] itholic opened a new pull request, #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

itholic opened a new pull request, #39258:
URL: https://github.com/apache/spark/pull/39258

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to assign name to _LEGACY_ERROR_TEMP_2149, "MALFORMED_CSV_RECORD".
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   We should assign proper name to _LEGACY_ERROR_TEMP_*
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   Fix UT & run `./build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062224061


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",

Review Comment:
   I covered it as well in this PR because it's in the same error chain. it is found by: https://github.com/apache/spark/pull/39258#discussion_r1058263385.
   
   Maybe do we want to do it in a separate PR ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149
URL: https://github.com/apache/spark/pull/39258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059273190


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."

Review Comment:
   This one process actual CSV records:
   ```
     private def convert(tokens: Array[String]): Option[InternalRow] = {
       if (tokens == null) {
         throw BadRecordException(
           () => getCurrentInput,
           () => None,
           QueryExecutionErrors.malformedCSVRecordError())
       }
   ```
   and `getCurrentInput` can get you a bad record. Can't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #39258:
URL: https://github.com/apache/spark/pull/39258#issuecomment-1386734384

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059240891


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."

Review Comment:
   Seems like this exception is triggered when checking the schema before parsing the actual input row.
   
   Maybe should we fix the error class name and message more properly?
   
   For example:
   ```json
     "MALFORMED_CSV_SCHEMA" : {
       "message" : [
         "Malformed CSV schema. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."
       ]
     }
   ```
   
   Or maybe we can show the malformed schema itself:
   ```json
     "MALFORMED_CSV_SCHEMA" : {
       "message" : [
         "Malformed CSV schema. <malformedSchema>."
       ]
     }
   ```
   
   This will display look like:
   ```
   [MALFORMED_CSV_RECORD] Malformed CSV schema: StructType(StructField(year,StringType,true),StructField(make,StringType,true),StructField(model,StringType,true),StructField(comment,StringType,true),StructField(blank,StringType,true))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062222396


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -319,15 +319,17 @@ class UnivocityParser(
       throw BadRecordException(
         () => getCurrentInput,
         () => None,
-        QueryExecutionErrors.malformedCSVRecordError())
+        QueryExecutionErrors.malformedCSVRecordError(""))
     }
 
+    val currentInput = getCurrentInput

Review Comment:
   It seems like the `getCurrentInput` pops the current input and remove it from the queue or something ??
   
   (I guess that's way the previous commit was failed: https://github.com/itholic/spark/runs/10428900823)
   
   If we use `getCurrentInput` in errors directly, the current input is spent so the following code couldn't get the proper input, so `getCurrentInput` returns null which is not correct at:
   ```scala
           throw BadRecordException(
             () => getCurrentInput, () => requiredRow.headOption, badRecordException.get)
         } else {
           requiredRow
         }
   ```
   
   So, it eventually complains:
   ```
   == Results ==
   !== Correct Answer - 2 ==   == Spark Answer - 2 ==
   !struct<>                   struct<from_csv(value):struct<a:int,b:int,_unparsed:string>>
    [[2,12,null]]              [[2,12,null]]
   ![[null,null,"]]            [[null,null,null]]
   ```
   
   Maybe is there any way to keep the value in the memory even after calling `getCurrentInput` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058972310


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."

Review Comment:
   Can we output the malformed CSV record? If an user process billions of CSV lines, how he/she should figure out which one is bad?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058741274


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record"

Review Comment:
   Thanks!
   I think we can provide the schema information, since it raised when schema mismatch.
   Just add some more informations to error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062224061


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",

Review Comment:
   I covered it as well in this PR because it's in the same error chain. it is found by: https://github.com/apache/spark/pull/39258#discussion_r1058263385



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059273564


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -370,8 +370,17 @@ abstract class CSVSuite
           .load(testFile(carsFile)).collect()
       }
 
-      assert(exception.getMessage.contains("Malformed CSV record"))
-      assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",
+        parameters = Map("failFastMode" -> "FAILFAST")
+      )

Review Comment:
   You could move the check to those specific test suites.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1063124561


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",
+        parameters = Map("failFastMode" -> "FAILFAST")
+      )
+    }
+  }
 }
 
 class CSVv2Suite extends CSVSuite {
   override protected def sparkConf: SparkConf =
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v2") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2064",

Review Comment:
   Yeah, it's because of the same reason with the above. Do we want to do it in separate PR and revert it from here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1060285608


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."

Review Comment:
   Oh... I didn't know `getCurrentInput`. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059243439


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -370,8 +370,17 @@ abstract class CSVSuite
           .load(testFile(carsFile)).collect()
       }
 
-      assert(exception.getMessage.contains("Malformed CSV record"))
-      assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",
+        parameters = Map("failFastMode" -> "FAILFAST")
+      )

Review Comment:
   This fails on `CSVv2Suite` since it raises `_LEGACY_ERROR_TEMP_2064` instead of `_LEGACY_ERROR_TEMP_2177`.
   
   Is there any way we separate the tests between CSVv1 and CSVv2 ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058263385


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -370,8 +370,11 @@ abstract class CSVSuite
           .load(testFile(carsFile)).collect()
       }
 
-      assert(exception.getMessage.contains("Malformed CSV record"))
-      assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+      checkError(
+        exception = ExceptionUtils.getRootCause(exception).asInstanceOf[SparkRuntimeException],
+        errorClass = "MALFORMED_CSV_RECORD",

Review Comment:
   hmm... there is `_LEGACY_ERROR_TEMP_2177` on the error chain, should we handle it in this one together?
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058438876


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record"

Review Comment:
   Can we provide more info to users? At least which record is malformed, and where is the problem - schema mismatch, the converter failed, and so on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1058741568


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -370,8 +370,11 @@ abstract class CSVSuite
           .load(testFile(carsFile)).collect()
       }
 
-      assert(exception.getMessage.contains("Malformed CSV record"))
-      assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+      checkError(
+        exception = ExceptionUtils.getRootCause(exception).asInstanceOf[SparkRuntimeException],
+        errorClass = "MALFORMED_CSV_RECORD",

Review Comment:
   Just added test case for `_LEGACY_ERROR_TEMP_2177`.
   Maybe we can assign the proper name for it in the separate ticket



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -370,8 +370,11 @@ abstract class CSVSuite
           .load(testFile(carsFile)).collect()
       }
 
-      assert(exception.getMessage.contains("Malformed CSV record"))
-      assert(ExceptionUtils.getRootCause(exception).isInstanceOf[RuntimeException])
+      checkError(
+        exception = ExceptionUtils.getRootCause(exception).asInstanceOf[SparkRuntimeException],
+        errorClass = "MALFORMED_CSV_RECORD",

Review Comment:
   Thanks! Just added test case for `_LEGACY_ERROR_TEMP_2177`.
   Maybe we can assign the proper name for it in the separate ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1059240891


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -851,6 +851,11 @@
       "Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
     ]
   },
+  "MALFORMED_CSV_RECORD" : {
+    "message" : [
+      "Malformed CSV record. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."

Review Comment:
   Seems like this exception is triggered when checking the schema before parsing the actual input row.
   
   Maybe should we fix the error class name and message more properly?
   
   For example:
   ```json
     "MALFORMED_CSV_SCHEMA" : {
       "message" : [
         "Malformed CSV schema. The number of tokens <tokenLength> doesn't match the schema <schemaLength>."
       ]
     }
   ```
   
   Or maybe we can just show the malformed schema itself?
   ```json
     "MALFORMED_CSV_SCHEMA" : {
       "message" : [
         "Malformed CSV schema. <malformedSchema>."
       ]
     }
   ```
   
   This will display look like:
   ```
   [MALFORMED_CSV_RECORD] Malformed CSV schema: StructType(StructField(year,StringType,true),StructField(make,StringType,true),StructField(model,StringType,true),StructField(comment,StringType,true),StructField(blank,StringType,true))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #39258: [SPARK-41572][SQL] Assign name to _LEGACY_ERROR_TEMP_2149

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on code in PR #39258:
URL: https://github.com/apache/spark/pull/39258#discussion_r1062158983


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",

Review Comment:
   Could you explain why did you add the test for the error class in the PR. 



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"

Review Comment:
   The same is defined in the parent class (just make it as `protected`). Please, remove it.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -319,15 +319,17 @@ class UnivocityParser(
       throw BadRecordException(
         () => getCurrentInput,
         () => None,
-        QueryExecutionErrors.malformedCSVRecordError())
+        QueryExecutionErrors.malformedCSVRecordError(""))
     }
 
+    val currentInput = getCurrentInput

Review Comment:
   It is not used in regular cases, correct? Don't think we should introduce additional overhead. Please, use `getCurrentInput` directly in errors.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala:
##########
@@ -3138,13 +3141,54 @@ class CSVv1Suite extends CSVSuite {
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "csv")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v1") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2177",
+        parameters = Map("failFastMode" -> "FAILFAST")
+      )
+    }
+  }
 }
 
 class CSVv2Suite extends CSVSuite {
   override protected def sparkConf: SparkConf =
     super
       .sparkConf
       .set(SQLConf.USE_V1_SOURCE_LIST, "")
+
+  private val carsFile = "test-data/cars.csv"
+
+  test("test for FAILFAST parsing mode on CSV v2") {
+    Seq(false, true).foreach { multiLine =>
+      val exception = intercept[SparkException] {
+        spark.read
+          .format("csv")
+          .option("multiLine", multiLine)
+          .options(Map("header" -> "true", "mode" -> "failfast"))
+          .load(testFile(carsFile)).collect()
+      }
+
+      checkError(
+        exception = exception.getCause.asInstanceOf[SparkException],
+        errorClass = "_LEGACY_ERROR_TEMP_2064",

Review Comment:
   The same question like above. How this is related to assigning a name to `_LEGACY_ERROR_TEMP_2149`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org