You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/30 15:12:43 UTC

[GitHub] [iceberg] bryanck opened a new pull request, #5168: Arrow: Pad decimal bytes before passing to decimal vector

bryanck opened a new pull request, #5168:
URL: https://github.com/apache/iceberg/pull/5168

   The vectorized reader benchmarks showed that the Iceberg Parquet vectorized reader falls behind the one in Spark when reading decimal types. When profiling the code, a bottleneck was discovered in a method in Arrow that pads the byte buffer when setting a value in the DecimalVector, specifically [this operation](https://github.com/apache/arrow/blob/fb6f200278ecdf65394fc293de8d35edfcda8bde/java/vector/src/main/java/org/apache/arrow/vector/DecimalVector.java#L241).
   
   Runs of [this benchmark](https://gist.github.com/anonymous/1c5ceca58222430d7dfb85bc5cc0e6a1) showed that calling `Unsafe.setMemory()` can be slower than Java array operations. Results of a run are [here](https://gist.github.com/bryanck/9e293b8b10d7f29810ec8fdbb0582ae8).
   
   This PR adds a workaround that pads the byte buffer before calling `setBigEndian()` to avoid `Unsafe.setMemory()` from being called.
   
   Here are the results of a run of the `VectorizedReadDictionaryEncodedFlatParquetDataBenchmark` benchmark without this change:
   ```
   Benchmark                                                                                  Mode  Cnt   Score   Error  Units
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5   2.016 ± 0.069   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5   2.083 ± 0.076   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  14.451 ± 0.273   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5   6.886 ± 0.163   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5   2.058 ± 0.108   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5   1.731 ± 0.117   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5   1.905 ± 0.016   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5   2.436 ± 0.178   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5   2.975 ± 0.053   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5   2.461 ± 0.951   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5   2.713 ± 0.075   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5   2.321 ± 0.953   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5   3.154 ± 0.062   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5   4.567 ± 1.864   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5   2.674 ± 0.085   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5   2.634 ± 0.089   s/op
   ```
   
   Here are the results of a run with this change:
   ```
   Benchmark                                                                                  Mode  Cnt  Score   Error  Units
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  2.339 ± 1.092   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  2.204 ± 0.085   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.501 ± 0.129   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.130 ± 0.111   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.677 ± 0.083   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.251 ± 0.142   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.616 ± 0.090   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.438 ± 0.074   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.620 ± 0.171   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.242 ± 0.140   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.679 ± 0.084   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.504 ± 0.173   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  3.804 ± 0.215   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.864 ± 0.163   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  2.544 ± 0.086   s/op
   VectorizedReadDictionaryEncodedFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  2.524 ± 0.193   s/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912546668


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);

Review Comment:
   I believe so, for Decimal it looked like it was always set to 16.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r913086603


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   On second thought about reusing the buffer, we could create a buffer per value reader so the width of the value is the same, then skip the array fill (if you have 2 buffers, one for negative and one for positive values)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544384


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/DecimalVectorUtil.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.arrow.vectorized.parquet;
+
+import java.util.Arrays;
+
+public class DecimalVectorUtil {
+
+  private DecimalVectorUtil() {
+  }
+
+  /**
+   * Parquet stores decimal values in big-endian byte order, and Arrow stores them in native byte order.
+   * When setting the value in Arrow, we call setBigEndian(), and the byte order is reversed if needed.
+   * Also, the byte array is padded to fill 16 bytes in length by calling Unsafe.setMemory(). The padding
+   * operation can be slow, so by using this utility method, we can pad before calling setBigEndian() and
+   * avoid the call to Unsafe.setMemory().
+   *
+   * @param bigEndianBytes The big endian bytes
+   * @param newLength      The length of the byte array to return
+   * @return The new byte array
+   */
+  public static byte[] padBigEndianBytes(byte[] bigEndianBytes, int newLength) {
+    if (bigEndianBytes.length == newLength) {
+      return bigEndianBytes;
+    } else if (bigEndianBytes.length < newLength) {
+      byte[] result = new byte[newLength];
+      if (bigEndianBytes.length == 0) {
+        return result;

Review Comment:
   @bryanck, is this hit? It looks like an invalid case, but we're choosing to return 0 for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544290


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);
       ((DecimalVector) vector).setBigEndian(idx, vectorBytes);
-      ByteBuffer buffer = dict.decodeToBinary(currentVal).toByteBuffer();
-      vector.getDataBuffer().setBytes(idx, buffer);

Review Comment:
   @bryanck, was this really setting the value twice? It looks like it was calling `setBigEndian` on the vector and then `setBytes` on the backing buffer. That could explain a lot of the slowness as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r913077062


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);

Review Comment:
   `typeWidth` is the Parquet width, I believe, which is variable depending on the scale of the decimal, but the Arrow width is always 16.



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);

Review Comment:
   I believe so, for Decimal it looked like it was always set to 16.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912546998


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -369,9 +370,10 @@ protected void nextDictEncodedVal(
         reader.fixedLengthDecimalDictEncodedReader()
             .nextBatch(vector, idx, numValuesToRead, dict, nullabilityHolder, typeWidth);
       } else if (Mode.PACKED.equals(mode)) {
-        ByteBuffer decimalBytes = dict.decodeToBinary(reader.readInteger()).toByteBuffer();
-        byte[] vectorBytes = new byte[typeWidth];
-        System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);

Review Comment:
   I believe this would have thrown an `ArrayStoreException`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r913086603


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   On second thought about reusing the buffer, we could create a buffer per value reader so the width of the value is the same, then skip the array fill.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544560


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);

Review Comment:
   Is `typeWidth` going to be the same as `DecimalVector.TYPE_WIDTH`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544834


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -369,9 +370,10 @@ protected void nextDictEncodedVal(
         reader.fixedLengthDecimalDictEncodedReader()
             .nextBatch(vector, idx, numValuesToRead, dict, nullabilityHolder, typeWidth);
       } else if (Mode.PACKED.equals(mode)) {
-        ByteBuffer decimalBytes = dict.decodeToBinary(reader.readInteger()).toByteBuffer();
-        byte[] vectorBytes = new byte[typeWidth];
-        System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);

Review Comment:
   Was this correct before? It looks like it was trying to use `System.arraycopy` with a `ByteBuffer`!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912546374


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);
       ((DecimalVector) vector).setBigEndian(idx, vectorBytes);
-      ByteBuffer buffer = dict.decodeToBinary(currentVal).toByteBuffer();
-      vector.getDataBuffer().setBytes(idx, buffer);

Review Comment:
   It looks like that's what it was doing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912549586


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   (It looks like dictionary encoding for fixed length byte arrays wouldn't work correctly anyway, I may follow up with a fix for that)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r913181715


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   Here's the [PR](https://github.com/apache/iceberg/pull/5198) that has a fix for the dictionary encoding



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#issuecomment-1171349495

   I'd be interested to see what results others are seeing with the benchmarks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r913077062


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java:
##########
@@ -128,12 +128,11 @@ protected void nextVal(FieldVector vector, Dictionary dict, int idx, int current
   class FixedLengthDecimalDictEncodedReader extends BaseDictEncodedReader {
     @Override
     protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
-      byte[] decimalBytes = dict.decodeToBinary(currentVal).getBytesUnsafe();
-      byte[] vectorBytes = new byte[typeWidth];
-      System.arraycopy(decimalBytes, 0, vectorBytes, 0, typeWidth);
+      byte[] vectorBytes =
+          DecimalVectorUtil.padBigEndianBytes(
+              dict.decodeToBinary(currentVal).getBytesUnsafe(),
+              DecimalVector.TYPE_WIDTH);

Review Comment:
   `typeWidth` is the Parquet width, I believe, which is variable depending on the precision of the decimal, but the Arrow width is always 16.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] danielcweeks merged pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

danielcweeks merged PR #5168:
URL: https://github.com/apache/iceberg/pull/5168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912546490


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/DecimalVectorUtil.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.arrow.vectorized.parquet;
+
+import java.util.Arrays;
+
+public class DecimalVectorUtil {
+
+  private DecimalVectorUtil() {
+  }
+
+  /**
+   * Parquet stores decimal values in big-endian byte order, and Arrow stores them in native byte order.
+   * When setting the value in Arrow, we call setBigEndian(), and the byte order is reversed if needed.
+   * Also, the byte array is padded to fill 16 bytes in length by calling Unsafe.setMemory(). The padding
+   * operation can be slow, so by using this utility method, we can pad before calling setBigEndian() and
+   * avoid the call to Unsafe.setMemory().
+   *
+   * @param bigEndianBytes The big endian bytes
+   * @param newLength      The length of the byte array to return
+   * @return The new byte array
+   */
+  public static byte[] padBigEndianBytes(byte[] bigEndianBytes, int newLength) {
+    if (bigEndianBytes.length == newLength) {
+      return bigEndianBytes;
+    } else if (bigEndianBytes.length < newLength) {
+      byte[] result = new byte[newLength];
+      if (bigEndianBytes.length == 0) {
+        return result;

Review Comment:
   Probably not, I was mimicking the behavior in `DecimalVector.setBigEndian()` to be on the safe side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912547730


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   One thing that was a little bit faster was to bypass `DecimalVector.setBigEndian()`, convert to little endian (if needed) and copy the bytes directly to the value buffer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912547916


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   Also one thing to note is that the benchmark isn't quite right. Decimal(20,5) will end up taking 9 bytes and will thus use a fixed length byte array instead of long or int encoding. And fixed length byte arrays aren't dictionary encoded in Parquet v1. That explains why the decimal benchmark is much slower than the other data types (which are dictionary encoded).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

bryanck commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912546780


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   I did some testing, and reusing the buffer was a little bit slower, partly because we need to always to fill the buffer to zero out the last value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544384


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/DecimalVectorUtil.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.arrow.vectorized.parquet;
+
+import java.util.Arrays;
+
+public class DecimalVectorUtil {
+
+  private DecimalVectorUtil() {
+  }
+
+  /**
+   * Parquet stores decimal values in big-endian byte order, and Arrow stores them in native byte order.
+   * When setting the value in Arrow, we call setBigEndian(), and the byte order is reversed if needed.
+   * Also, the byte array is padded to fill 16 bytes in length by calling Unsafe.setMemory(). The padding
+   * operation can be slow, so by using this utility method, we can pad before calling setBigEndian() and
+   * avoid the call to Unsafe.setMemory().
+   *
+   * @param bigEndianBytes The big endian bytes
+   * @param newLength      The length of the byte array to return
+   * @return The new byte array
+   */
+  public static byte[] padBigEndianBytes(byte[] bigEndianBytes, int newLength) {
+    if (bigEndianBytes.length == newLength) {
+      return bigEndianBytes;
+    } else if (bigEndianBytes.length < newLength) {
+      byte[] result = new byte[newLength];
+      if (bigEndianBytes.length == 0) {
+        return result;

Review Comment:
   @bryanck, is this hit? It looks like an invalid case because the decimal precision would need to be 0, but we're choosing to return 0 for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #5168: Arrow: Pad decimal bytes before passing to decimal vector

Posted by GitBox <gi...@apache.org>.

rdblue commented on code in PR #5168:
URL: https://github.com/apache/iceberg/pull/5168#discussion_r912544742


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java:
##########
@@ -358,7 +358,8 @@ class FixedLengthDecimalReader extends BaseReader {
     protected void nextVal(
         FieldVector vector, int idx, ValuesAsBytesReader valuesReader, int typeWidth, byte[] byteArray) {
       valuesReader.getBuffer(typeWidth).get(byteArray, 0, typeWidth);
-      ((DecimalVector) vector).setBigEndian(idx, byteArray);
+      byte[] vectorBytes = DecimalVectorUtil.padBigEndianBytes(byteArray, DecimalVector.TYPE_WIDTH);

Review Comment:
   This looks like a place where we could reuse a buffer rather than allocating in `padBigEndianBytes` every time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org