You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/14 06:34:26 UTC

[GitHub] [spark] MaxGekk opened a new pull request #35844: [SPARK-38523][SQL][3.2] Fix referring to the corrupt record column from CSV

MaxGekk opened a new pull request #35844:
URL: https://github.com/apache/spark/pull/35844


   ### What changes were proposed in this pull request?
   In the case when an user specifies the corrupt record column via the CSV option `columnNameOfCorruptRecord`:
   1. Disable the column pruning feature in the CSV parser.
   2. Don't push filters to `UnivocityParser` that refer to the "virtual" column `columnNameOfCorruptRecord`. Since the column cannot present in the input CSV, user's queries fail while compiling predicates. After the changes, the skipped filters are applied later on the upper layer.
   
   ### Why are the changes needed?
   The changes allow to refer to the corrupt record column from user's queries:
   
   ```Scala
   spark.read.format("csv")
     .option("header", "true")
     .option("columnNameOfCorruptRecord", "corrRec")
     .schema(schema)
     .load("csv_corrupt_record.csv")
     .filter($"corrRec".isNotNull)
     .show()
   ```
   for the input file "csv_corrupt_record.csv":
   ```
   0,2013-111_11 12:13:14
   1,1983-08-04 
   ```
   the query returns:
   ```
   +---+----+----------------------+
   |a  |b   |corrRec               |
   +---+----+----------------------+
   |0  |null|0,2013-111_11 12:13:14|
   +---+----+----------------------+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Before the changes, the query above fails with the exception:
   ```Java
   java.lang.IllegalArgumentException: _corrupt_record does not exist. Available: a, b
   	at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:310) ~[classes/:?]
   ```
   
   ### How was this patch tested?
   By running new CSV test:
   ```
   $ build/sbt "sql/testOnly *.CSVv1Suite"
   $ build/sbt "sql/testOnly *.CSVv2Suite"
   $ build/sbt "sql/testOnly *.CSVLegacyTimeParserSuite"
   ```
   
   Authored-by: Max Gekk <ma...@gmail.com>
   Signed-off-by: Wenchen Fan <we...@databricks.com>
   (cherry picked from commit 959694271e30879c944d7fd5de2740571012460a)
   Signed-off-by: Max Gekk <ma...@gmail.com>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #35844: [SPARK-38523][SQL][3.2] Fix referring to the corrupt record column from CSV

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on pull request #35844:
URL: https://github.com/apache/spark/pull/35844#issuecomment-1066553317


   Merging to 3.2. Thank you, @cloud-fan and @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #35844: [SPARK-38523][SQL][3.2] Fix referring to the corrupt record column from CSV

Posted by GitBox <gi...@apache.org>.

MaxGekk closed pull request #35844:
URL: https://github.com/apache/spark/pull/35844


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org