You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2024/02/01 18:41:06 UTC

Re: [PR] [SPARK-46890][SQL] Fix CSV parsing bug with existence default values and column pruning [spark]

MaxGekk commented on code in PR #44939:
URL: https://github.com/apache/spark/pull/44939#discussion_r1474938302


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala:
##########
@@ -105,8 +105,6 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
       sparkSession.sessionState.conf.csvColumnPruning,
       sparkSession.sessionState.conf.sessionLocalTimeZone,
       sparkSession.sessionState.conf.columnNameOfCorruptRecord)
-    val isColumnPruningEnabled = parsedOptions.isColumnPruningEnabled

Review Comment:
   This one is stored to a value to transfer (in a serializable  form) it from the driver to executors because you cannot access to `parameters` on executors.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala:
##########
@@ -278,13 +280,28 @@ class CSVOptions(
     .getOrElse(UNESCAPED_QUOTE_HANDLING, "STOP_AT_DELIMITER").toUpperCase(Locale.ROOT))
 
   /**
+   * Returns true if column pruning is enabled and there are no existence column default values in
+   * the [[schema]].
+   *
    * The column pruning feature can be enabled either via the CSV option `columnPruning` or
    * in non-multiline mode via initialization of CSV options by the SQL config:
    * `spark.sql.csv.parser.columnPruning.enabled`.
    * The feature is disabled in the `multiLine` mode because of the issue:
    * https://github.com/uniVocity/univocity-parsers/issues/529
+   *
+   * We disable column pruning when there are any column defaults, instead preferring to reach in
+   * each row and then post-process it to substitute the default values after.
    */
-  val isColumnPruningEnabled: Boolean = getBool(COLUMN_PRUNING, !multiLine && columnPruning)
+  def isColumnPruningEnabled(schema: StructType): Boolean = {
+    var result = !multiLine && columnPruning
+    if (parameters != null) {

Review Comment:
   We shouldn't even try to access to `parameters` when it is `null` because:
   1. We never pass `null` as `parameters`.
   2. `parameters` are `@transient`, so, it is serialized and apparently not transferred to executors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org