You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/18 00:01:36 UTC
[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix isVectorizationEnabled in Spark3Util

shardulm94 commented on a change in pull request #2248:
URL: https://github.com/apache/iceberg/pull/2248#discussion_r578029296



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,28 @@ public static boolean isLocalityEnabled(FileIO io, String location, CaseInsensit
     return false;
   }
 
-  public static boolean isVectorizationEnabled(Map<String, String> properties, CaseInsensitiveStringMap readOptions) {
+  public static boolean isVectorizationEnabled(FileFormat fileFormat,

Review comment:
       Seems internal to me too. I would be +1 on changing it. I miss Scala's package scope access modifiers in such cases.

##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,28 @@ public static boolean isLocalityEnabled(FileIO io, String location, CaseInsensit
     return false;
   }
 
-  public static boolean isVectorizationEnabled(Map<String, String> properties, CaseInsensitiveStringMap readOptions) {
+  public static boolean isVectorizationEnabled(FileFormat fileFormat,
+                                               Map<String, String> properties,
+                                               CaseInsensitiveStringMap readOptions) {
     String batchReadsSessionConf = SparkSession.active().conf()
         .get("spark.sql.iceberg.vectorization.enabled", null);
     if (batchReadsSessionConf != null) {
       return Boolean.valueOf(batchReadsSessionConf);
     }
-    return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED,
-        PropertyUtil.propertyAsBoolean(properties,
-            TableProperties.PARQUET_VECTORIZATION_ENABLED, TableProperties.PARQUET_VECTORIZATION_ENABLED_DEFAULT));
+
+    switch (fileFormat) {
+      case PARQUET:
+        boolean defaultValue = PropertyUtil.propertyAsBoolean(
+            properties,
+            TableProperties.PARQUET_VECTORIZATION_ENABLED,
+            TableProperties.PARQUET_VECTORIZATION_ENABLED_DEFAULT);
+        return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED, defaultValue);
+      case ORC:
+        // TODO: support a table property to enable/disable vectorized reads in ORC
+        return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED, true);

Review comment:
       We can add one. My initial plan was to remove the Parquet specific table property and just have a generic table property for vectorization, but I just forgot about it. We just pass it as a datasource option because we have another layer above Iceberg where we set that. 
   
   ORC vectorized reader supports all datatypes, so I don't see an issue with vectorization being enabled by default. It won't work with delete files, but I think we have checks elsewhere for that. We can keep it disabled by default for backwards compatibility maybe?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org