You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/22 19:51:54 UTC

[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix vectorization flags

shardulm94 commented on a change in pull request #2248:
URL: https://github.com/apache/iceberg/pull/2248#discussion_r580536744



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,31 @@ public static boolean isLocalityEnabled(FileIO io, String location, CaseInsensit
     return false;
   }
 
-  public static boolean isVectorizationEnabled(Map<String, String> properties, CaseInsensitiveStringMap readOptions) {
+  public static boolean isVectorizationEnabled(FileFormat fileFormat,
+                                               Map<String, String> properties,
+                                               CaseInsensitiveStringMap readOptions) {
     String batchReadsSessionConf = SparkSession.active().conf()
         .get("spark.sql.iceberg.vectorization.enabled", null);
     if (batchReadsSessionConf != null) {
       return Boolean.valueOf(batchReadsSessionConf);
     }
-    return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED,

Review comment:
       I see tradeoffs either way. I agree that the most specific value is ideally the read options explicitly passed to the table read. But a session conf taking higher precedence is also convenient in production to turn off vectorization for an application by a pure config change, no need for code changes.
   
   Another option we have is to use a boolean `AND` between the session conf and read option. This is used in https://github.com/apache/iceberg/blob/91ac42174e4c535ece4e36db2cb587a23babced9/spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L182
   It can be a little confusing here if the default of session conf (true) is different than the default of read option (false), but is worth considering.

##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -338,12 +330,50 @@ public boolean enableBatchRead() {
 
       boolean hasNoDeleteFiles = tasks().stream().noneMatch(TableScanUtil::hasDeletes);
 
+      boolean batchReadsEnabled = batchReadsEnabled(allParquetFileScanTasks, allOrcFileScanTasks);
+
       this.readUsingBatch = batchReadsEnabled && hasNoDeleteFiles && (allOrcFileScanTasks ||
           (allParquetFileScanTasks && atLeastOneColumn && onlyPrimitives));
     }
     return readUsingBatch;
   }
 
+  private boolean batchReadsEnabled(boolean isParquetOnly, boolean isOrcOnly) {
+    if (isParquetOnly) {
+      return isVectorizationEnabled(FileFormat.PARQUET);
+    } else if (isOrcOnly) {
+      return isVectorizationEnabled(FileFormat.ORC);
+    } else {
+      return false;
+    }
+  }
+
+  public boolean isVectorizationEnabled(FileFormat fileFormat) {

Review comment:
       Agreed, it may also be good to factor out Iceberg session confs into a class of its own along with the defaults. We have three right now.
   ```
   spark.sql.iceberg.vectorization.enabled
   spark.sql.iceberg.check-ordering
   spark.sql.iceberg.check-nullability
   ```
   
   Also probably worth adding these configs to the documentation




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org