You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/23 20:32:39 UTC

[GitHub] [spark] TonyDoen opened a new pull request #35954: [WIP][SPARK-38639] Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

TonyDoen opened a new pull request #35954:
URL: https://github.com/apache/spark/pull/35954


   
   ### What changes were proposed in this pull request?
   This PR adds a "spark.sql.hive.ignoreCorruptRecord"  to fill out the functionality that users can query successfully in dirty data(mixed schema in one table).
   
   
   ### Why are the changes needed?
   There's an existing flag "spark.sql.files.ignoreCorruptFiles" and "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads from files that have been corrupted, but it still allows the query to fail on sequence files.
   
   Being able to ignore corrupt record is useful in the scenarios that users want to query successfully in dirty data(mixed schema in one table).
   
   We would like to add a "spark.sql.hive.ignoreCorruptRecord"  to fill out the functionality.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, add new config: "spark.sql.hive.ignoreCorruptRecord"
   
   
   ### How was this patch tested?
   Manually tested in local
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #35954: [WIP][SPARK-38639] Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #35954:
URL: https://github.com/apache/spark/pull/35954#discussion_r833859115



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
##########
@@ -745,43 +745,48 @@ case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: Int)
 
     val durationMs = longMetric("pipelineTime")
 
+    val ignoreCorruptRecord: Boolean = conf.ignoreCorruptRecord

Review comment:
       It's a hive specific conf. can we avoid fixing the sql execution core?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] TonyDoen commented on a change in pull request #35954: [WIP][SPARK-38639] Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

Posted by GitBox <gi...@apache.org>.

TonyDoen commented on a change in pull request #35954:
URL: https://github.com/apache/spark/pull/35954#discussion_r833899230



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
##########
@@ -745,43 +745,48 @@ case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: Int)
 
     val durationMs = longMetric("pipelineTime")
 
+    val ignoreCorruptRecord: Boolean = conf.ignoreCorruptRecord

Review comment:
       Thanks for reviewing. In my case, we need it to catch [java.lang.ArrayIndexOutOfBoundsException],[java.lang.NullPointerException], when we queried sequence file table(mixed schema in one table) .
   
   It confused me too, but this solution worked  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org