You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/04 20:23:15 UTC

[GitHub] [incubator-iceberg] rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests

rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   This adds ORC to Spark tests that run for each file format.
   
   This also updates `TestSparkReadProjection` and `TestFilteredScan` to use Iceberg generics instead of Avro generics for the test cases because ORC doesn't support Avro records. It's good to remove the use of Avro in favor of Iceberg generics as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#issuecomment-610680239
 
 
   To follow up on my comment about a possible memory leak: I profiled the tests and didn't find a memory leak. Everything was correctly cleaned up after the test method. The solution was to avoid keeping so many copies of the test data in memory. ORC probably allocated more memory and that's why we hit the problem, but I don't think this is a problem with an ORC memory leak.
   
   What I did find surprising was that ORC is spending a lot more time allocating than Parquet or Avro. ORC was spending about 10x the amount of time for these tests than Parquet when writing, mostly in allocation.
   
   Here's a flame graph:
   ![orc-allocation](https://user-images.githubusercontent.com/87915/78730567-0c2e4880-78f2-11ea-994b-0fd3700ba45c.png)
   
   This is probably just initial allocations that are amortized over a large write. This test only uses 100 records.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#discussion_r404347638
 
 

 ##########
 File path: spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java
 ##########
 @@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.sql.Timestamp;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.time.temporal.ChronoUnit;
+import java.util.Collection;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.Assert;
+import scala.collection.Seq;
+
+import static org.apache.iceberg.spark.SparkSchemaUtil.convert;
+import static scala.collection.JavaConverters.mapAsJavaMapConverter;
+import static scala.collection.JavaConverters.seqAsJavaListConverter;
+
+public class GenericsHelpers {
+  private GenericsHelpers() {
+  }
+
+  private static final OffsetDateTime EPOCH = Instant.ofEpochMilli(0L).atOffset(ZoneOffset.UTC);
+  private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate();
+
+  public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row) {
+    List<Types.NestedField> fields = struct.fields();
+    for (int i = 0; i < fields.size(); i += 1) {
+      Type fieldType = fields.get(i).type();
+
+      Object expectedValue = rec.get(i);
+      Object actualValue = row.get(i);
+
+      assertEqualsSafe(fieldType, expectedValue, actualValue);
+    }
+  }
+
+  private static void assertEqualsSafe(Types.ListType list, Collection<?> expected, List<?> actual) {
+    Type elementType = list.elementType();
+    List<?> expectedElements = Lists.newArrayList(expected);
+    for (int i = 0; i < expectedElements.size(); i += 1) {
+      Object expectedValue = expectedElements.get(i);
+      Object actualValue = actual.get(i);
+
+      assertEqualsSafe(elementType, expectedValue, actualValue);
+    }
+  }
+
+  private static void assertEqualsSafe(Types.MapType map,
+                                       Map<?, ?> expected, Map<?, ?> actual) {
 
 Review comment:
   seems like actual can have more elements. Is that ok?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#discussion_r404401235
 
 

 ##########
 File path: spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java
 ##########
 @@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.sql.Timestamp;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.time.temporal.ChronoUnit;
+import java.util.Collection;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.Assert;
+import scala.collection.Seq;
+
+import static org.apache.iceberg.spark.SparkSchemaUtil.convert;
+import static scala.collection.JavaConverters.mapAsJavaMapConverter;
+import static scala.collection.JavaConverters.seqAsJavaListConverter;
+
+public class GenericsHelpers {
+  private GenericsHelpers() {
+  }
+
+  private static final OffsetDateTime EPOCH = Instant.ofEpochMilli(0L).atOffset(ZoneOffset.UTC);
+  private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate();
+
+  public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row) {
+    List<Types.NestedField> fields = struct.fields();
+    for (int i = 0; i < fields.size(); i += 1) {
+      Type fieldType = fields.get(i).type();
+
+      Object expectedValue = rec.get(i);
+      Object actualValue = row.get(i);
+
+      assertEqualsSafe(fieldType, expectedValue, actualValue);
+    }
+  }
+
+  private static void assertEqualsSafe(Types.ListType list, Collection<?> expected, List<?> actual) {
+    Type elementType = list.elementType();
+    List<?> expectedElements = Lists.newArrayList(expected);
+    for (int i = 0; i < expectedElements.size(); i += 1) {
+      Object expectedValue = expectedElements.get(i);
+      Object actualValue = actual.get(i);
+
+      assertEqualsSafe(elementType, expectedValue, actualValue);
+    }
+  }
+
+  private static void assertEqualsSafe(Types.MapType map,
+                                       Map<?, ?> expected, Map<?, ?> actual) {
 
 Review comment:
   Added an assertion for this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#discussion_r404338412
 
 

 ##########
 File path: spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java
 ##########
 @@ -0,0 +1,293 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data;
+
+import com.google.common.collect.Lists;
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.sql.Timestamp;
+import java.time.Instant;
+import java.time.LocalDate;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.time.temporal.ChronoUnit;
+import java.util.Collection;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import java.util.UUID;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.Assert;
+import scala.collection.Seq;
+
+import static org.apache.iceberg.spark.SparkSchemaUtil.convert;
+import static scala.collection.JavaConverters.mapAsJavaMapConverter;
+import static scala.collection.JavaConverters.seqAsJavaListConverter;
+
+public class GenericsHelpers {
+  private GenericsHelpers() {
+  }
+
+  private static final OffsetDateTime EPOCH = Instant.ofEpochMilli(0L).atOffset(ZoneOffset.UTC);
+  private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate();
+
+  public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row) {
 
 Review comment:
   nit: might help to know which is the expected and which is the actual record for comparison

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   This adds ORC to Spark tests that run for each file format.
   
   This also updates `TestSparkReadProjection` and `TestFilteredScan` to use Iceberg generics instead of Avro generics for the test cases because ORC doesn't support Avro records. It's good to remove the use of Avro in favor of Iceberg generics as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] danielcweeks merged pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
danielcweeks merged pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on issue #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdsr commented on issue #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#issuecomment-610059565
 
 
   @shardulm94  you might be interested to have a look at this PR

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue closed pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue closed pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue opened a new pull request #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892
 
 
   This adds ORC to Spark tests that run for each file format.
   
   This also updates `TestSparkReadProjection` and `TestFilteredScan` to use Iceberg generics instead of Avro generics for the test cases because ORC doesn't support Avro records. It's good to remove the use of Avro in favor of Iceberg generics as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#issuecomment-610011554
 
 
   FYI, tests are passing locally, but not in Travis (OOM, then timeouts). I'm trying to fix it by increasing the heap size. This may signal that there is an ORC memory leak?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] danielcweeks commented on issue #892: Spark: Add ORC to parameterized tests

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #892: Spark: Add ORC to parameterized tests
URL: https://github.com/apache/incubator-iceberg/pull/892#issuecomment-611614127
 
 
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org