You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/30 09:41:25 UTC

[GitHub] [iceberg] openinx opened a new pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

openinx opened a new pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271

This PR addressed the bug in https://github.com/apache/iceberg/issues/1269, it mainly fixed the two sub-issues:

1. when writing a Decimal (precision<=18) into hive orc file, the orc writer will scale down the decimal. for example, we have a value 10.100 for type `Decimal(10, 3)`, the hive orc will remove all the trailing zero and store it as 101*10^(-1), mean precision is 3 and scale is 1. Here the scale of decimal read from hive orc file, is not strictly equal to 3. so for both spark orc reader and generic orc reader we need to transform it to the given scale =3 . Otherwise, the unit test will be broken.

2. The long value of zoned timestamp can be negative, while we spark orc reader/writer did not consider this case, and just use the `/` and `%` to do the arithmetic computation, while actually we should use `Math.floorDiv` and `Math.floorMod`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#issuecomment-669522605


   This looks good to me, except that we need to update the test to validate against the original in-memory records, not against a set that was read from a file. It would also be good to have a test that specifically exercises the decimal path, or increase the number of random records until we are confident that one decimal will have one or more trailing 0s.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#issuecomment-670276468


   OK, let me resolve the conflicts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r464835177



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());
+
+      Preconditions.checkArgument(value.precision() <= precision && precision <= 18,

Review comment:
       ditto.

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -288,8 +289,10 @@ public void nonNullWrite(int rowId, LocalDate data, ColumnVector output) {
     @Override
     public void nonNullWrite(int rowId, OffsetDateTime data, ColumnVector output) {
       TimestampColumnVector cv = (TimestampColumnVector) output;
-      cv.time[rowId] = data.toInstant().toEpochMilli(); // millis
-      cv.nanos[rowId] = (data.getNano() / 1_000) * 1_000; // truncate nanos to only keep microsecond precision
+      // millis
+      cv.time[rowId] = data.toInstant().toEpochMilli();
+      // truncate nanos to only keep microsecond precision
+      cv.nanos[rowId] = Math.floorDiv(data.getNano(), 1_000) * 1_000;

Review comment:
       OK, I saw that the javadoc says the nano-of-second is from 0 to 999,999,999.  you're right, we don't need the `floorDiv` here. 

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -324,14 +329,24 @@ public void nonNullWrite(int rowId, LocalDateTime data, ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision <= 18,

Review comment:
       the `precision <=18 ` can be removed now, because we've checked it [here](https://github.com/apache/iceberg/pull/1271/files#diff-b1b07b15f036000a3f2bed76fdd9f961R108).

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -212,6 +218,10 @@ public Decimal nonNullRead(ColumnVector vector, int row) {
     public Decimal nonNullRead(ColumnVector vector, int row) {
       BigDecimal value = ((DecimalColumnVector) vector).vector[row]
           .getHiveDecimal().bigDecimalValue();
+
+      Preconditions.checkArgument(value.precision() <= precision && precision <= 38,

Review comment:
       ditto.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       Sounds great.  The essential purpose here is to construct a `Decimal` with the correct `precision` and `scale` ( instead of the `value.precision()` and `value.scale()`. 

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -340,7 +355,11 @@ public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision <= 38,

Review comment:
       ditto.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
##########
@@ -237,9 +239,14 @@ public void addValue(int rowId, int column, SpecializedGetters data,
         output.noNulls = false;
         output.isNull[rowId] = true;
       } else {
+        Decimal decimal = data.getDecimal(column, precision, scale);
+        Preconditions.checkArgument(scale == decimal.scale(),
+            "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
+        Preconditions.checkArgument(decimal.precision() <= precision && precision <= 18,
+            "Cannot write value as decimal(%s,%s), invalid precision: %s", decimal);

Review comment:
       Make sense.  they could be removed now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r464774002



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       `value.serialize64()` will take in an expected scale as a parameter, so I think the only change required to the original code is to pass our expected reader scale into `value.serialize64()` instead of passing `value.scale()` and passing expected precision and scale to `Decimal.set`.
   
   So this would look like `return new Decimal().set(value.serialize64(scale), precision, scale);`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r465426358



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       Checked this again,  I wrongly used the `return new Decimal().set(value.serialize64(value.scale()), precision, scale)` to construct the decimal before, which broken the unit tests.  You are right,  the long value is not tied to any precision.  Sorry for the noisy.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r466198908



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestSparkRecordOrcReaderWriter.java
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.RandomGenericData;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.orc.GenericOrcWriter;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+
+public class TestSparkRecordOrcReaderWriter extends AvroDataTest {
+  private static final int NUM_RECORDS = 200;
+
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<Record> records = RandomGenericData.generate(schema, NUM_RECORDS, 1992L);

Review comment:
       It make sense. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r466014650



##########
File path: spark/src/test/java/org/apache/iceberg/spark/data/TestSparkRecordOrcReaderWriter.java
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.List;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.data.RandomGenericData;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.orc.GenericOrcWriter;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.junit.Assert;
+
+public class TestSparkRecordOrcReaderWriter extends AvroDataTest {
+  private static final int NUM_RECORDS = 200;
+
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<Record> records = RandomGenericData.generate(schema, NUM_RECORDS, 1992L);

Review comment:
       Validation should be done against this data, not data that has been read from a file. That way the test won't be broken by a problem with the reader or writer that produces the expected rows. To validate against these, use the `GenericsHelpers.assertEqualsUnsafe` methods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r466203001



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +197,15 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+
+      // The scale of decimal read from hive ORC file may be not equals to the expected scale. For data type
+      // decimal(10,3) and the value 10.100, the hive ORC writer will remove its trailing zero and store it
+      // as 101*10^(-1), its scale will adjust from 3 to 1. So here we could not assert that value.scale() == scale.
+      // we also need to convert the hive orc decimal to a decimal with expected precision and scale.
+      Preconditions.checkArgument(value.precision() <= precision,
+          "Cannot read value as decimal(%s,%s), too large: %s", precision, scale, value);

Review comment:
       It is necessary to do this check. we need to make sure that there's no bug when written a decimal into ORC. For example,  for decimal(3, 0) data type we encounter a hive decimal `10000` (whose precision is 5), that should be something wrong.  Throwing an exception is the correct way in that case. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r465199882



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       I believe `value.serialize64` returns the unscaled long value (since precision <= 18, it always fits in long), I don't think it is tied to any precision. Can you give an example of the case you are referring to?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#issuecomment-669216486


   Ping @shardulm94 @rdsr @rdblue , any other concern ?  Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r465199882



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       I believe `value.serialize64` returns the raw long value adjusted for the requested scale (and since precision <= 18, it always fits in long), I don't think it is tied to any precision. That being said, I am not very familiar with using decimals, so maybe I am missing something. Can you give an example of the case you are referring to?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#issuecomment-670281703


   Rebased the patch, and let's wait for the travis testing result. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r464849520



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       Oh,  seems it's still incorrect.  Because the `value.serialize64(scale)` is still encoded by `value.precision()`  and `value.scale()`. we use the given `precision` and `scale` to parse this long value,  it will be messed up.  Notice, the value.precision is not equals to `precision`, similar to scale.  
   
   The correct way should be: 
   ```java
   Decimal decimal = new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
   decimal.changePrecision(precision, scale);
   ```
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r466007312



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +197,15 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+
+      // The scale of decimal read from hive ORC file may be not equals to the expected scale. For data type
+      // decimal(10,3) and the value 10.100, the hive ORC writer will remove its trailing zero and store it
+      // as 101*10^(-1), its scale will adjust from 3 to 1. So here we could not assert that value.scale() == scale.
+      // we also need to convert the hive orc decimal to a decimal with expected precision and scale.
+      Preconditions.checkArgument(value.precision() <= precision,
+          "Cannot read value as decimal(%s,%s), too large: %s", precision, scale, value);

Review comment:
       I'm not sure we need to check the precision either. If we read a value, then we should return it, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r464729558



##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -288,8 +289,10 @@ public void nonNullWrite(int rowId, LocalDate data, ColumnVector output) {
     @Override
     public void nonNullWrite(int rowId, OffsetDateTime data, ColumnVector output) {
       TimestampColumnVector cv = (TimestampColumnVector) output;
-      cv.time[rowId] = data.toInstant().toEpochMilli(); // millis
-      cv.nanos[rowId] = (data.getNano() / 1_000) * 1_000; // truncate nanos to only keep microsecond precision
+      // millis
+      cv.time[rowId] = data.toInstant().toEpochMilli();
+      // truncate nanos to only keep microsecond precision
+      cv.nanos[rowId] = Math.floorDiv(data.getNano(), 1_000) * 1_000;

Review comment:
       data.getNano() always returns positive integer, so is this change required?

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       `value.serialize64()` will take in an expected scale as a parameter, so I think the only change required to the original code is to pass our expected reader scale into `value.serialize64()` instead of passing `value.scale()`.

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -324,14 +329,24 @@ public void nonNullWrite(int rowId, LocalDateTime data, ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision <= 18,

Review comment:
       Nit: `precision <= 18` check can be moved into the constructor

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
##########
@@ -237,9 +239,14 @@ public void addValue(int rowId, int column, SpecializedGetters data,
         output.noNulls = false;
         output.isNull[rowId] = true;
       } else {
+        Decimal decimal = data.getDecimal(column, precision, scale);
+        Preconditions.checkArgument(scale == decimal.scale(),
+            "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
+        Preconditions.checkArgument(decimal.precision() <= precision && precision <= 18,
+            "Cannot write value as decimal(%s,%s), invalid precision: %s", decimal);

Review comment:
       This check seems redundant to me. If we are already passing our expected precision and scale to `data.getDecimal()`, wont the scale and precision of the returned decimal always match?

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -288,8 +289,10 @@ public void nonNullWrite(int rowId, LocalDate data, ColumnVector output) {
     @Override
     public void nonNullWrite(int rowId, OffsetDateTime data, ColumnVector output) {
       TimestampColumnVector cv = (TimestampColumnVector) output;
-      cv.time[rowId] = data.toInstant().toEpochMilli(); // millis
-      cv.nanos[rowId] = (data.getNano() / 1_000) * 1_000; // truncate nanos to only keep microsecond precision
+      // millis
+      cv.time[rowId] = data.toInstant().toEpochMilli();
+      // truncate nanos to only keep microsecond precision
+      cv.nanos[rowId] = Math.floorDiv(data.getNano(), 1_000) * 1_000;

Review comment:
       If it is indeed needed, we should also update `TimestampWriter`.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
##########
@@ -261,9 +268,14 @@ public void addValue(int rowId, int column, SpecializedGetters data,
         output.isNull[rowId] = true;
       } else {
         output.isNull[rowId] = false;
-        ((DecimalColumnVector) output).vector[rowId].set(
-            HiveDecimal.create(data.getDecimal(column, precision, scale)
-                .toJavaBigDecimal()));
+
+        Decimal decimal = data.getDecimal(column, precision, scale);
+        Preconditions.checkArgument(scale == decimal.scale(),
+            "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, decimal);
+        Preconditions.checkArgument(decimal.precision() <= precision && precision <= 38,
+            "Cannot write value as decimal(%s,%s), invalid precision: %s", precision, scale, decimal);

Review comment:
       This check seems redundant to me. If we are already passing our expected precision and scale to `data.getDecimal()`, wont the scale and precision of the returned decimal always match?

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -212,6 +218,10 @@ public Decimal nonNullRead(ColumnVector vector, int row) {
     public Decimal nonNullRead(ColumnVector vector, int row) {
       BigDecimal value = ((DecimalColumnVector) vector).vector[row]
           .getHiveDecimal().bigDecimalValue();
+
+      Preconditions.checkArgument(value.precision() <= precision && precision <= 38,

Review comment:
       Nit: `precision <= 38` check can be moved into the constructor

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -340,7 +355,11 @@ public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision <= 38,

Review comment:
       Nit: `precision <= 38` check can be moved into the constructor

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());
+
+      Preconditions.checkArgument(value.precision() <= precision && precision <= 18,

Review comment:
       Nit: `precision <= 18` check can be moved into the constructor




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#issuecomment-670265634


   @openinx, I'm ready to merge this. Thanks for updating the tests! The only blocker is that the conflicts need to be fixed. Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org