You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "johanl-db (via GitHub)" <gi...@apache.org> on 2023/12/27 17:08:17 UTC

[PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

johanl-db opened a new pull request, #44513:
URL: https://github.com/apache/spark/pull/44513

   ### What changes were proposed in this pull request?
   This is a follow-up from https://github.com/apache/spark/pull/44368 implementing an additional type promotion to decimals with larger precision and scale.
   As long as the precision increases by at least as much as the scale, the decimal values can be promoted without loss of precision: Decimal(6, 2) -> Decimal(8, 4):  1234.56 -> 1234.5600.
   
   The non-vectorized reader (parquet-mr) is already able to do this type promotion, this PR implements it for the vectorized reader.
   
   ### Why are the changes needed?
   This allows reading multiple parquet files that contain decimal with different precision/scales
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the following now succeeds when using the vectorized Parquet reader:
   ```
     Seq(20).toDF($"a".cast(DecimalType(4, 2))).write.parquet(path)
     spark.read.schema("a decimal(6, 4)").parquet(path).collect()
   ```
   It failed before with the vectorized reader and succeeded with the non-vectorized reader.
   
   ### How was this patch tested?
   - Tests added to `ParquetWideningTypeSuite` to cover decimal promotion between decimals with different physical types: INT32, INT64, FIXED_LEN_BYTE_ARRAY.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "johanl-db (via GitHub)" <gi...@apache.org>.

johanl-db commented on PR #44513:
URL: https://github.com/apache/spark/pull/44513#issuecomment-1871326044

   > ```
   >  - SPARK-34212 Parquet should read decimals correctly *** FAILED *** (364 milliseconds)
   > [info]   Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (ParquetQuerySuite.scala:1055)
   > [info]   org.scalatest.exceptions.TestFailedException:
   > [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   > [info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   > [info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
   > [info]   at org.scalatest.Assertions.intercept(Assertions.scala:766)
   > [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
   > [info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564)
   > [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$213(ParquetQuerySuite.scala:1055)
   > [info]   at scala.collection.immutable.List.foreach(List.scala:333)
   > [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$211(ParquetQuerySuite.scala:1054)
   > [info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   > [info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
   > [info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
   > [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetQuerySuite.scala:47)
   > ```
   > 
   > The failed test case seems to be related to this PR, could you check it? @johanl-db
   
   I fixed it, the last check timed out on another unrelated test though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44513:
URL: https://github.com/apache/spark/pull/44513#discussion_r1444301954


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1444,14 +1641,29 @@ private static boolean isDateTypeMatched(ColumnDescriptor descriptor) {
   }
 
   private static boolean isDecimalTypeMatched(ColumnDescriptor descriptor, DataType dt) {

Review Comment:
   I think the name should be `isDecimalTypeCompatible`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44513:
URL: https://github.com/apache/spark/pull/44513#issuecomment-1870922213

   ```
    - SPARK-34212 Parquet should read decimals correctly *** FAILED *** (364 milliseconds)
   [info]   Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (ParquetQuerySuite.scala:1055)
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
   [info]   at org.scalatest.Assertions.intercept(Assertions.scala:766)
   [info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
   [info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564)
   [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$213(ParquetQuerySuite.scala:1055)
   [info]   at scala.collection.immutable.List.foreach(List.scala:333)
   [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$211(ParquetQuerySuite.scala:1054)
   [info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   [info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
   [info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
   [info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetQuerySuite.scala:47)
   ```
   
   The failed test case seems to be related to this PR, could you check it? @johanl-db 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44513:
URL: https://github.com/apache/spark/pull/44513#discussion_r1444301267


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1418,16 +1610,21 @@ private SchemaColumnConvertNotSupportedException constructConvertNotSupportedExc
 
   private static boolean canReadAsIntDecimal(ColumnDescriptor descriptor, DataType dt) {
     if (!DecimalType.is32BitDecimalType(dt)) return false;
-    return isDecimalTypeMatched(descriptor, dt);
+    return isDecimalTypeMatched(descriptor, dt) && isSameDecimalScale(descriptor, dt);

Review Comment:
   is this extra restriction a bug fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44513:
URL: https://github.com/apache/spark/pull/44513#discussion_r1444305951


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1358,6 +1368,188 @@ public void decodeSingleDictionaryId(
     }
   }
 
+  private abstract static class DecimalUpdater implements ParquetVectorUpdater {
+
+    private final DecimalType sparkType;
+
+    DecimalUpdater(DecimalType sparkType) {
+      this.sparkType = sparkType;
+    }
+
+    @Override
+    public void readValues(
+        int total,
+        int offset,
+        WritableColumnVector values,
+        VectorizedValuesReader valuesReader) {
+      for (int i = 0; i < total; i++) {
+        readValue(offset + i, values, valuesReader);
+      }
+    }
+
+    protected void writeDecimal(int offset, WritableColumnVector values, BigDecimal decimal) {
+      BigDecimal scaledDecimal = decimal.setScale(sparkType.scale(), RoundingMode.UNNECESSARY);
+      if (DecimalType.is32BitDecimalType(sparkType)) {
+        values.putInt(offset, scaledDecimal.unscaledValue().intValue());
+      } else if (DecimalType.is64BitDecimalType(sparkType)) {
+        values.putLong(offset, scaledDecimal.unscaledValue().longValue());
+      } else {
+        values.putByteArray(offset, scaledDecimal.unscaledValue().toByteArray());
+      }
+    }
+  }
+
+  private static class IntegerToDecimalUpdater extends DecimalUpdater {
+    private final int parquetScale;
+
+    IntegerToDecimalUpdater(ColumnDescriptor descriptor, DecimalType sparkType) {
+      super(sparkType);
+      LogicalTypeAnnotation typeAnnotation =
+        descriptor.getPrimitiveType().getLogicalTypeAnnotation();
+      this.parquetScale = ((DecimalLogicalTypeAnnotation) typeAnnotation).getScale();
+    }
+
+    @Override
+    public void skipValues(int total, VectorizedValuesReader valuesReader) {
+        valuesReader.skipIntegers(total);

Review Comment:
   ```suggestion
         valuesReader.skipIntegers(total);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44513:
URL: https://github.com/apache/spark/pull/44513#issuecomment-1880598516

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #44513: [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
URL: https://github.com/apache/spark/pull/44513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44513:
URL: https://github.com/apache/spark/pull/44513#discussion_r1444301267


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1418,16 +1610,21 @@ private SchemaColumnConvertNotSupportedException constructConvertNotSupportedExc
 
   private static boolean canReadAsIntDecimal(ColumnDescriptor descriptor, DataType dt) {
     if (!DecimalType.is32BitDecimalType(dt)) return false;
-    return isDecimalTypeMatched(descriptor, dt);
+    return isDecimalTypeMatched(descriptor, dt) && isSameDecimalScale(descriptor, dt);

Review Comment:
   is this extra restriction a bug fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "johanl-db (via GitHub)" <gi...@apache.org>.

johanl-db commented on PR #44513:
URL: https://github.com/apache/spark/pull/44513#issuecomment-1876775193

   @LuciferYang or @cloud-fan since you approved the previous similar change https://github.com/apache/spark/pull/44368, could you take a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers [spark]

Posted by "johanl-db (via GitHub)" <gi...@apache.org>.

johanl-db commented on code in PR #44513:
URL: https://github.com/apache/spark/pull/44513#discussion_r1437160457


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java:
##########
@@ -152,32 +152,50 @@ private boolean isLazyDecodingSupported(
     switch (typeName) {
       case INT32: {
         boolean isDate = logicalTypeAnnotation instanceof DateLogicalTypeAnnotation;
-        boolean needsUpcast = sparkType == LongType || (isDate && sparkType == TimestampNTZType) ||
-          !DecimalType.is32BitDecimalType(sparkType);
+        boolean isDecimal = logicalTypeAnnotation instanceof DecimalLogicalTypeAnnotation;
+        boolean needsUpcast = sparkType == LongType || sparkType == DoubleType ||
+          (isDate && sparkType == TimestampNTZType) ||
+          (isDecimal && !DecimalType.is32BitDecimalType(sparkType));

Review Comment:
   This fixes an issue from https://github.com/apache/spark/pull/44368, we were incorrectly disabling lazy dictionary decoding for any non-decimal (INT32) type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org