You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/02 01:59:18 UTC

[GitHub] [iceberg] szehon-ho opened a new pull request #2777: Core : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

szehon-ho opened a new pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777


   When running add_files on partitioned source table, noticed that a significant time is spent in the Hive listPartitions call.
   
   It might be good to push down the filter to Hive (faster database query on the partitions in question, and less traffic getting serialized/deserialized across the wire).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665733226



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
##########
@@ -318,6 +318,26 @@ public void addFilteredPartitionsToPartitioned() {
         sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void addFilteredPartitionsToPartitioned2() {
+    createCompositePartitionedTable("parquet");

Review comment:
       I think we need a SparkTableUtil test for the new getPartitions code as well, unless that's a pain




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668394463



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       That said if you are looking for a better method here we do have all of guava, so you can use
   ImmutableMaps.of(....)
   
   Or
   
   ImmutableMaps.Builder
    .put
    .put
    .build
    
    ```java
        Map<String, String> newProperties = ImmutableMap.of("hello", "world");
    ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665802993



##########
File path: versions.props
##########
@@ -18,6 +18,7 @@ org.apache.arrow:arrow-memory-netty = 2.0.0
 com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1
 software.amazon.awssdk:* = 2.15.7
 org.scala-lang:scala-library = 2.12.10
+org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0

Review comment:
       It would be nice to not have an additional explicit scala 2.12 dependency. I believe Spark2 can still be compiled for scala 2.11. Would this prohibit that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668410968



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       Oh right forgot we have guava, thanks, changed, its much cleaner now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665728903



##########
File path: versions.props
##########
@@ -18,6 +18,7 @@ org.apache.arrow:arrow-memory-netty = 2.0.0
 com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1
 software.amazon.awssdk:* = 2.15.7
 org.scala-lang:scala-library = 2.12.10
+org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0

Review comment:
       I think we can just use Optional.apply() instead of using the option converter here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668411024



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665726596



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
+                                                   Optional<Map<String, String>> partitionFilter) {
     try {
       SessionCatalog catalog = spark.sessionState().catalog();
       CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);
 
-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
+      Option<scala.collection.immutable.Map<String, String>> partSpec =

Review comment:
       I am not smart enough to hold this whole block in my head. Can we break this down into multiple lines?
   
   One line to make the scala Map
   One line to wrap it in optional?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668394463



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       That said if you are looking for a better method here we do have all of guava, so you can use
   Maps.of( k1, v1, k2,v2, ....)
   
   Or 
   ImmutableMaps.of(....)
   
   Or
   
   ImmutableMaps.Builder




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#issuecomment-875946902


   @RussellSpitzer  could you take a look if you have some time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665730907



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
+                                                   Optional<Map<String, String>> partitionFilter) {
     try {
       SessionCatalog catalog = spark.sessionState().catalog();
       CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);
 
-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
+      Option<scala.collection.immutable.Map<String, String>> partSpec =

Review comment:
       Or could just do something like
   ```java      
   Map<String, String> pkfilter = Maps.newHashMap();
         Option<scala.collection.mutable.Map<String, String>> scalaFilter =
             Option.apply(scala.collection.JavaConverters.mapAsScalaMapConverter(pkfilter).asScala());
    ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665727186



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,

Review comment:
       Do we benefit from having this as a Java Optional? Since we have to immediately convert it to Scala maybe we should just pass a normal Map?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668394463



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       That said if you are looking for a better method here we do have all of guava, so you can use
   ImmutableMaps.of(....)
   
   Or
   
   ImmutableMaps.Builder
    .put
    .put
    .build




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#issuecomment-879161879


   Thanks @szehon-ho !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665731756



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
+                                                   Optional<Map<String, String>> partitionFilter) {
     try {
       SessionCatalog catalog = spark.sessionState().catalog();
       CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);
 
-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
+      Option<scala.collection.immutable.Map<String, String>> partSpec =

Review comment:
       But you can static import JavaConverters here too to make it shorter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665730907



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
+                                                   Optional<Map<String, String>> partitionFilter) {
     try {
       SessionCatalog catalog = spark.sessionState().catalog();
       CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);
 
-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
+      Option<scala.collection.immutable.Map<String, String>> partSpec =

Review comment:
       Or could just do something like
   ```java      
   Option<scala.collection.mutable.Map<String, String>> scalaFilter =
             Option.apply(scala.collection.JavaConverters.mapAsScalaMapConverter(pkfilter).asScala());
    ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer merged pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer merged pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668411132



##########
File path: versions.props
##########
@@ -18,6 +18,7 @@ org.apache.arrow:arrow-memory-netty = 2.0.0
 com.github.stephenc.findbugs:findbugs-annotations = 1.3.9-1
 software.amazon.awssdk:* = 2.15.7
 org.scala-lang:scala-library = 2.12.10
+org.scala-lang.modules:scala-java8-compat_2.12 = 0.8.0

Review comment:
       Yep you guys convinced me, got rid of it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668393897



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       :( I think we still are running against java 8




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r665727927



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +154,21 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter the partition filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,

Review comment:
       I think the "filterPartitions" function would just use an empty map as "no filter"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668360368



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java
##########
@@ -151,15 +152,23 @@ private SparkTableUtil() {
    *
    * @param spark a Spark session
    * @param tableIdent a table identifier
+   * @param partitionFilter partition filter, or null if no filter
    * @return all table's partitions
    */
-  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent) {
+  public static List<SparkPartition> getPartitions(SparkSession spark, TableIdentifier tableIdent,
+                                                   Map<String, String> partitionFilter) {
     try {
       SessionCatalog catalog = spark.sessionState().catalog();
       CatalogTable catalogTable = catalog.getTableMetadata(tableIdent);
 
-      Seq<CatalogTablePartition> partitions = catalog.listPartitions(tableIdent, Option.empty());
-
+      Option<scala.collection.immutable.Map<String, String>> scalaPartitionFilter;
+      if (partitionFilter != null && !partitionFilter.isEmpty()) {
+        scalaPartitionFilter = Option.apply(JavaConverters.mapAsScalaMapConverter(partitionFilter).asScala()
+            .toMap(Predef.conforms()));

Review comment:
       The scala api requires immuable map, hence this extra step




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a change in pull request #2777: Spark : Add Files Perf improvement by push down partition filter to Spark/Hive catalog

Posted by GitBox <gi...@apache.org>.

szehon-ho commented on a change in pull request #2777:
URL: https://github.com/apache/iceberg/pull/2777#discussion_r668360043



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -369,6 +376,84 @@ public void testImportUnpartitionedWithWhitespace() throws Exception {
     }
   }
 
+  public static class GetPartitions {
+
+    @Rule
+    public TemporaryFolder temp = new TemporaryFolder();
+
+    // This logic does not really depend on format
+    private final FileFormat format = FileFormat.PARQUET;
+
+    @Test
+    public void testPartitionScan() throws Exception {
+
+      List<ThreeColumnRecord> records = Lists.newArrayList(
+          new ThreeColumnRecord(1, "ab", "data"),
+          new ThreeColumnRecord(2, "b c", "data"),
+          new ThreeColumnRecord(1, "b c", "data"),
+          new ThreeColumnRecord(2, "ab", "data"));
+
+      File location = temp.newFolder("partitioned_table");
+      String tableName = "external_table";
+
+      spark.createDataFrame(records, ThreeColumnRecord.class)
+          .write().mode("overwrite").format(format.toString())
+          .partitionBy("c1", "c2").saveAsTable(tableName);
+
+      TableIdentifier source = spark.sessionState().sqlParser()
+          .parseTableIdentifier(tableName);
+
+      Map<String, String> partition1 = Stream.of(

Review comment:
       I suppose we cannot use Java9 factory methods in Spark2 project




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org