You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/11 13:03:25 UTC

[GitHub] [hive] lcspinter opened a new pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

lcspinter opened a new pull request #2261:
URL: https://github.com/apache/hive/pull/2261


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   Introduce a caching layer to store iceberg table objects used within the same query. The entries are saved with the following key format `query_id.catalog_name.table_identifier`.
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   Catalogs.loadTable(configuration, props) is called numerous times within the same query which can be expensive. 
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   Manual test, unit test
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636196743



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    // look for the table object stored in the query state. If it's null, it means the table was not loaded yet
+    // within the same query therefore we claim it through the Catalogs API and then store it in query state.
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    if (queryState != null) {

Review comment:
       Do we expect the query state to be null at any point during compilation? If so, that would result in constant reloading of the table. So maybe we should either remove the null checks to simplify the code, or at least log it if it's not there? What do you think?

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    // look for the table object stored in the query state. If it's null, it means the table was not loaded yet

Review comment:
       nit: I think this comment belongs to the javadoc more naturally, but it's up to you, I don't feel strongly on that




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636874439



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class IcebergTableUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(IcebergTableUtil.class);
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog. Look for the table
+   * object stored in the query state. If it's null, it means the table was not loaded yet within the same query
+   * therefore we claim it through the Catalogs API and then store it in query state.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    QueryState queryState = SessionState.get()

Review comment:
       Is it possible for `SessionState.get()` to return null? What happens on TezAM side?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630926018



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();

Review comment:
       Shall we decrease this a bit? Perhaps 30 minutes would also suffice? Not sure what a "long" compilation time means for Hive usually. Decreasing it could give us some extra safety against cache pollution in case the hook-based eviction doesn't always go smoothly. Worst case in a very long compilation scenario, we might reload the table again eventually, but I'm assuming compilations that take 30+ minutes should be pretty rare?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630915112



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();
+
+    public static void removeTable(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = keys.stream().filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+        keys.removeAll(queryKeys);
+      }

Review comment:
       shall we add an else branch here with some warn logging? It might be a problem here worth noting if the query id is not present in the config, because that will cause cache pollution over time. In the other methods, if the query id is not present we just reload the table, so that's not as big of a deal




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634346018



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +
+            "from the table cache.");
+      }
+    }
+
+    public static Table getTable(Configuration conf, String catalogName, String tableIdentifier) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null  && !queryId.isEmpty()) {
+        return tableCache.getIfPresent(getKey(queryId, catalogName, tableIdentifier));
+      }
+      return null;
+    }
+
+    public static void addTable(Configuration conf, String catalogName, String tableIdentifier, Table table) {

Review comment:
       The `tableCache` doesn't have a put method that accepts functional interfaces. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632404816



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +

Review comment:
       Is this really a `warn`? Do we expect at least one table it the cache every time?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632409952



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##########
@@ -307,7 +307,7 @@ private void commitTable(FileIO io, ExecutorService executor, JobContext jobCont
     if (catalogName != null) {
       catalogProperties.put(InputFormatConfig.CATALOG_NAME, catalogName);
     }
-    Table table = Catalogs.loadTable(conf, catalogProperties);
+    Table table = Catalogs.loadTableSkipCache(conf, catalogProperties);

Review comment:
       Should it be `Catalogs.invalidate` if the catalog implements `CachingCatalog` interface?
   This way we break the compatibility with other Catalogs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter merged pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter merged pull request #2261:
URL: https://github.com/apache/hive/pull/2261


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636875464



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    // look for the table object stored in the query state. If it's null, it means the table was not loaded yet
+    // within the same query therefore we claim it through the Catalogs API and then store it in query state.
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    if (queryState != null) {

Review comment:
       Maybe Optional<QueryState> ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636874752



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class IcebergTableUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(IcebergTableUtil.class);
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog. Look for the table
+   * object stored in the query state. If it's null, it means the table was not loaded yet within the same query
+   * therefore we claim it through the Catalogs API and then store it in query state.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    String tableIdentifier = properties.getProperty(Catalogs.NAME);
+    if (queryState != null) {
+      table = (Table) queryState.getResource(tableIdentifier);
+    } else {
+      LOG.debug("QueryState is not available in SessionState. Loading {} from configured catalog.", tableIdentifier);
+    }
+    if (table == null) {

Review comment:
       nit: new lines

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class IcebergTableUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(IcebergTableUtil.class);
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog. Look for the table
+   * object stored in the query state. If it's null, it means the table was not loaded yet within the same query
+   * therefore we claim it through the Catalogs API and then store it in query state.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    String tableIdentifier = properties.getProperty(Catalogs.NAME);
+    if (queryState != null) {
+      table = (Table) queryState.getResource(tableIdentifier);
+    } else {
+      LOG.debug("QueryState is not available in SessionState. Loading {} from configured catalog.", tableIdentifier);
+    }
+    if (table == null) {

Review comment:
       nit: new line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r631066543



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();

Review comment:
       I kept the 12 hours threshold since we want to keep the cache entries until the execution is finished.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634500584



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +

Review comment:
       I think what we do expect is that the `hive.query.id` is always populated in the config. If it's absent, that's a warning sign and something which could prevent us from evicting from the cache. This could be a useful log line in case we see that the cache gets overly large and we're looking for the root cause why the iceberg tables are not evicted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636169311



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -97,13 +98,15 @@ public static Table loadTable(Configuration conf, Properties props) {
             props.getProperty(InputFormatConfig.CATALOG_NAME));
   }
 
-  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation,
-                                 String catalogName) {
+  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation, String catalogName) {
     Optional<Catalog> catalog = loadCatalog(conf, catalogName);
 
+    Table cachedTable = null;

Review comment:
       Not at all. The revert was not 100% successful. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r639094461



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -87,8 +87,8 @@ public static Table loadTable(Configuration conf) {
    * The table identifier ({@link Catalogs#NAME}) and the catalog name ({@link InputFormatConfig#CATALOG_NAME}),
    * or table path ({@link Catalogs#LOCATION}) should be specified by the controlling properties.
    * <p>
-   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler
-   * @param conf a Hadoop
+   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler.
+   * @param conf a Hadoop configuration

Review comment:
       I would rather keep it, since it corrects the Javadoc. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632407545



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {

Review comment:
       Do we need the whole `conf` here? Shouldn't we separate out the concerns and use only an `executionId` instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r639572145



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/Driver.java
##########
@@ -213,7 +216,14 @@ private void runInternal(String command, boolean alreadyCompiled) throws Command
         releaseResources();
       }
 
+      if (SessionState.get() != null) {
+        // Clean up every resource object stored in the query state
+        driverContext.getQueryState().removeResources();

Review comment:
       The QueryState in the DriverContext should be removed as well when the query finishes. Otherwise we still would have a leaked QueryState which is a memory leak itself. So this only makes the leak smaller, but does not fix it (if we have any)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r638615019



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/Driver.java
##########
@@ -213,7 +216,14 @@ private void runInternal(String command, boolean alreadyCompiled) throws Command
         releaseResources();
       }
 
+      if (SessionState.get() != null) {
+        // Clean up every resource object stored in the query state
+        driverContext.getQueryState().removeResources();

Review comment:
       Do we need this one?
   We remove the `QueryState` anyway...
   And then we do not need `removeResources` either




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636151988



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/QueryState.java
##########
@@ -59,6 +60,11 @@
 
   static public final String USERID_TAG = "userid";
 
+  /**
+   * map of tables involved in the query.
+   */
+  private final Map<String, Object> tableMap = new HashMap<>();

Review comment:
       Can we make this a generic map so we can store things other than tables here? We will need this container also for other info that - in the absence of this new query state feature - we previously had to force into the conf, such as write commit info (jobID, vertexID, taskNum), CTAS info (is the query ctas? what's the ctas target table name), etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636211834



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    // look for the table object stored in the query state. If it's null, it means the table was not loaded yet
+    // within the same query therefore we claim it through the Catalogs API and then store it in query state.
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    if (queryState != null) {

Review comment:
       Right now we don't expect it to be null, but I'm not sure about the availability of SessionState object. I noticed this comment in the class:  `SessionState is not available in runtime and Hive.get().getConf() is not safe to call`.
   So I put there this safeguard, to be more robust in case the SessionState is lost, but I agree some additional logging doesn't hurt. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r638514043



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class IcebergTableUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(IcebergTableUtil.class);
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog. Look for the table
+   * object stored in the query state. If it's null, it means the table was not loaded yet within the same query
+   * therefore we claim it through the Catalogs API and then store it in query state.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    QueryState queryState = SessionState.get()

Review comment:
       I added an additional check for null SessionState. This method is called only from HS2 side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634341370



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +

Review comment:
       No, we don't except, but I guess we should signal that the query id is not present config, which could lead to cache pollution. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630918129



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergQueryLifeTimeHook.java
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHook;
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHookContext;
+import org.apache.iceberg.mr.Catalogs;
+
+public class HiveIcebergQueryLifeTimeHook implements QueryLifeTimeHook {
+  @Override
+  public void beforeCompile(QueryLifeTimeHookContext ctx) {
+
+  }
+
+  @Override
+  public void afterCompile(QueryLifeTimeHookContext ctx, boolean hasError) {
+  }
+
+  @Override
+  public void beforeExecution(QueryLifeTimeHookContext ctx) {
+    Catalogs.TableCache.removeTable(ctx.getHiveConf());

Review comment:
       Shouldn't we call this in `afterCompile`? If there's a compilation error at some point, we won't even get to this method, no?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634359149



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##########
@@ -307,7 +307,7 @@ private void commitTable(FileIO io, ExecutorService executor, JobContext jobCont
     if (catalogName != null) {
       catalogProperties.put(InputFormatConfig.CATALOG_NAME, catalogName);
     }
-    Table table = Catalogs.loadTable(conf, catalogProperties);
+    Table table = Catalogs.loadTableSkipCache(conf, catalogProperties);

Review comment:
       I'm not sure I understand your point here. Could you please add a bit more details? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632405632



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +
+            "from the table cache.");
+      }
+    }
+
+    public static Table getTable(Configuration conf, String catalogName, String tableIdentifier) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null  && !queryId.isEmpty()) {
+        return tableCache.getIfPresent(getKey(queryId, catalogName, tableIdentifier));
+      }
+      return null;
+    }
+
+    public static void addTable(Configuration conf, String catalogName, String tableIdentifier, Table table) {

Review comment:
       Why not use lambda to generate the table object if not present? Like it is done with `putIfAbsent`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #2261:
URL: https://github.com/apache/hive/pull/2261#issuecomment-849410797


   LGTM, pending tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909697



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();

Review comment:
       Instead of maintaining the key set manually, you might be able to get the keys in the hook method by using `tableCache.getNativeCache().asMap().keySet()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r631063506



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();

Review comment:
       Right, I don't know how I missed that. I was looking for a map representation of the cache entries. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #2261:
URL: https://github.com/apache/hive/pull/2261#issuecomment-849630956


   @lcspinter Can you please replace the Catalogs.loadTable call in the `HiveIcebergStorageHandler#overlayTableProperties` as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630910187



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();
+
+    public static void removeTable(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = keys.stream().filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+        keys.removeAll(queryKeys);
+      }
+    }
+
+    public static Table getTable(Configuration conf, String catalogName, String tableIdentifier) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null  && !queryId.isEmpty()) {

Review comment:
       nit: double spaces after the queryId != null part, here and below

##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergQueryLifeTimeHook.java
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHook;
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHookContext;
+import org.apache.iceberg.mr.Catalogs;
+
+public class HiveIcebergQueryLifeTimeHook implements QueryLifeTimeHook {
+  @Override
+  public void beforeCompile(QueryLifeTimeHookContext ctx) {
+
+  }
+
+  @Override
+  public void afterCompile(QueryLifeTimeHookContext ctx, boolean hasError) {
+  }

Review comment:
       nit: missing new line, like for beforeCompile and afterExecution




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634355038



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {

Review comment:
       Yes, it makes sense. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on pull request #2261:
URL: https://github.com/apache/hive/pull/2261#issuecomment-838467872


   @marton-bod @pvary @szlta Could you please review this PR? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636872986



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -87,8 +87,8 @@ public static Table loadTable(Configuration conf) {
    * The table identifier ({@link Catalogs#NAME}) and the catalog name ({@link InputFormatConfig#CATALOG_NAME}),
    * or table path ({@link Catalogs#LOCATION}) should be specified by the controlling properties.
    * <p>
-   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler
-   * @param conf a Hadoop
+   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler.
+   * @param conf a Hadoop configuration

Review comment:
       MAybe we can revert all of the changes in this file?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909854



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();
+
+    public static void removeTable(Configuration conf) {

Review comment:
       Can we rename this to make it clear that it removes _all_ tables for a given query? `removeTables`, `removeTablesForQuery`, or something similar that you'd prefer. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632407679



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +
+            "from the table cache.");
+      }
+    }
+
+    public static Table getTable(Configuration conf, String catalogName, String tableIdentifier) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);

Review comment:
       Do we need the whole `conf` here? Shouldn't we separate out the concerns and use only an `executionId` instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636149696



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -349,6 +349,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map<String, Object> queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {
+    return queryStateMap.get(queryId);
+  }
+
+  public void addQueryState(String queryId, Object queryState) {

Review comment:
       Shouldn't we only accept only QueryState objects here?
   i.e. `public void addQueryState(String queryId, QueryState queryState) {`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r631063850



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();
+
+    public static void removeTable(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = keys.stream().filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+        keys.removeAll(queryKeys);
+      }

Review comment:
       Correct, added logging.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636177564



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -349,6 +350,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map<String, QueryState> queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {

Review comment:
       Sure!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r639676968



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/Driver.java
##########
@@ -213,7 +216,12 @@ private void runInternal(String command, boolean alreadyCompiled) throws Command
         releaseResources();
       }
 
+      if (SessionState.get() != null) {
+        // Remove any query state reference from the session state
+        SessionState.get().removeQueryState(getConf().get(HiveConf.ConfVars.HIVEQUERYID.varname));
+      }
       driverState.executionFinishedWithLocking(isFinishedWithError);
+
     }

Review comment:
       nit of the nit for the newlines:
   ```suggestion
         }
   
         if (SessionState.get() != null) {
           // Remove any query state reference from the session state
           SessionState.get().removeQueryState(getConf().get(HiveConf.ConfVars.HIVEQUERYID.varname));
         }
   
         driverState.executionFinishedWithLocking(isFinishedWithError);
       }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r632408645



##########
File path: iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerTestUtils.java
##########
@@ -68,6 +68,7 @@ static TestHiveShell shell(Map<String, String> configs) {
     TestHiveShell shell = new TestHiveShell();
     shell.setHiveConfValue("hive.notification.event.poll.interval", "-1");
     shell.setHiveConfValue("hive.tez.exec.print.summary", "true");
+    shell.setHiveConfValue("hive.query.lifetime.hooks", HiveIcebergQueryLifeTimeHook.class.getName());

Review comment:
       Maybe not here, but maybe a single big `hive.iceberg.enable` config should add all the required iceberg configs when the HS2 is started/or a query is executed...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636154183



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -97,13 +98,15 @@ public static Table loadTable(Configuration conf, Properties props) {
             props.getProperty(InputFormatConfig.CATALOG_NAME));
   }
 
-  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation,
-                                 String catalogName) {
+  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation, String catalogName) {
     Optional<Catalog> catalog = loadCatalog(conf, catalogName);
 
+    Table cachedTable = null;

Review comment:
       Are these changes still necessary?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909697



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Set<String> keys = new TreeSet<>();

Review comment:
       Instead of maintaining the key set manually, you might be able to get the keys in the hook method by using `tableCache.asMap().keySet()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r631064410



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergQueryLifeTimeHook.java
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHook;
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHookContext;
+import org.apache.iceberg.mr.Catalogs;
+
+public class HiveIcebergQueryLifeTimeHook implements QueryLifeTimeHook {
+  @Override
+  public void beforeCompile(QueryLifeTimeHookContext ctx) {
+
+  }
+
+  @Override
+  public void afterCompile(QueryLifeTimeHookContext ctx, boolean hasError) {
+  }
+
+  @Override
+  public void beforeExecution(QueryLifeTimeHookContext ctx) {
+    Catalogs.TableCache.removeTable(ctx.getHiveConf());

Review comment:
       I added it to `afterCompile` and `afterExecution` as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636872733



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -97,13 +97,13 @@ public static Table loadTable(Configuration conf, Properties props) {
             props.getProperty(InputFormatConfig.CATALOG_NAME));
   }
 
-  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation,
-                                 String catalogName) {
+  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation, String catalogName) {
     Optional<Catalog> catalog = loadCatalog(conf, catalogName);
 
     if (catalog.isPresent()) {
       Preconditions.checkArgument(tableIdentifier != null, "Table identifier not set");
-      return catalog.get().loadTable(TableIdentifier.parse(tableIdentifier));
+      Table table = catalog.get().loadTable(TableIdentifier.parse(tableIdentifier));

Review comment:
       Is this change needed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636874975



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class IcebergTableUtil {
+
+  private static final Logger LOG = LoggerFactory.getLogger(IcebergTableUtil.class);
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog. Look for the table
+   * object stored in the query state. If it's null, it means the table was not loaded yet within the same query
+   * therefore we claim it through the Catalogs API and then store it in query state.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    String tableIdentifier = properties.getProperty(Catalogs.NAME);
+    if (queryState != null) {
+      table = (Table) queryState.getResource(tableIdentifier);
+    } else {
+      LOG.debug("QueryState is not available in SessionState. Loading {} from configured catalog.", tableIdentifier);
+    }
+    if (table == null) {
+      table = Catalogs.loadTable(configuration, properties);
+      if (queryState != null) {
+        queryState.addResource(tableIdentifier, table);
+      }
+    }
+    return table;

Review comment:
       nit: new line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r639654144



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/Driver.java
##########
@@ -213,7 +216,14 @@ private void runInternal(String command, boolean alreadyCompiled) throws Command
         releaseResources();
       }
 
+      if (SessionState.get() != null) {
+        // Clean up every resource object stored in the query state
+        driverContext.getQueryState().removeResources();

Review comment:
       I've deleted the `removeResources`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r638513085



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -97,13 +97,13 @@ public static Table loadTable(Configuration conf, Properties props) {
             props.getProperty(InputFormatConfig.CATALOG_NAME));
   }
 
-  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation,
-                                 String catalogName) {
+  private static Table loadTable(Configuration conf, String tableIdentifier, String tableLocation, String catalogName) {
     Optional<Catalog> catalog = loadCatalog(conf, catalogName);
 
     if (catalog.isPresent()) {
       Preconditions.checkArgument(tableIdentifier != null, "Table identifier not set");
-      return catalog.get().loadTable(TableIdentifier.parse(tableIdentifier));
+      Table table = catalog.get().loadTable(TableIdentifier.parse(tableIdentifier));

Review comment:
       Reverted the whole file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r639093469



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/Driver.java
##########
@@ -213,7 +216,14 @@ private void runInternal(String command, boolean alreadyCompiled) throws Command
         releaseResources();
       }
 
+      if (SessionState.get() != null) {
+        // Clean up every resource object stored in the query state
+        driverContext.getQueryState().removeResources();

Review comment:
       I think we need it. We have an instance of `QueryState` stored on the `DriverContext`, and we have the same `QueryState` instance stored on `SessionState` as well. If I remove the `QueryState` just from the `SessionState`, we will end up with some dangling resources on the `QueryState` bound to the `DriverContext`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636168550



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -349,6 +349,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map<String, Object> queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {
+    return queryStateMap.get(queryId);
+  }
+
+  public void addQueryState(String queryId, Object queryState) {

Review comment:
       Right. Changed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636173569



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -349,6 +350,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map<String, QueryState> queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {

Review comment:
       Can we update the return type too so we don't need any casting?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636168810



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/QueryState.java
##########
@@ -59,6 +60,11 @@
 
   static public final String USERID_TAG = "userid";
 
+  /**
+   * map of tables involved in the query.
+   */
+  private final Map<String, Object> tableMap = new HashMap<>();

Review comment:
       Changed the naming of the map.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r638615584



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -87,8 +87,8 @@ public static Table loadTable(Configuration conf) {
    * The table identifier ({@link Catalogs#NAME}) and the catalog name ({@link InputFormatConfig#CATALOG_NAME}),
    * or table path ({@link Catalogs#LOCATION}) should be specified by the controlling properties.
    * <p>
-   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler
-   * @param conf a Hadoop
+   * Used by HiveIcebergSerDe and HiveIcebergStorageHandler.
+   * @param conf a Hadoop configuration

Review comment:
       MAybe revert this one too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] lcspinter commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
lcspinter commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r638511804



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+    // look for the table object stored in the query state. If it's null, it means the table was not loaded yet
+    // within the same query therefore we claim it through the Catalogs API and then store it in query state.
+    QueryState queryState = SessionState.get()
+        .getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+    Table table = null;
+    if (queryState != null) {

Review comment:
       In this case, wrapping the QueryState in an Optional doesn't add much value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634405626



##########
File path: iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##########
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, String catalogName) {
       }
     }
   }
+
+  public static class TableCache {
+    private static final Cache<String, Table> tableCache = Caffeine.newBuilder()
+        .expireAfterAccess(12, TimeUnit.HOURS).build();
+    private static final Logger LOG = LoggerFactory.getLogger(TableCache.class);
+
+    public static void removeTables(Configuration conf) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null && !queryId.isEmpty()) {
+        Set<String> queryKeys = tableCache.asMap().keySet().stream()
+            .filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+        tableCache.invalidateAll(queryKeys);
+      } else {
+        LOG.warn("Query id is not present in config, therefore no Iceberg table object is removed " +
+            "from the table cache.");
+      }
+    }
+
+    public static Table getTable(Configuration conf, String catalogName, String tableIdentifier) {
+      String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+      if (queryId != null  && !queryId.isEmpty()) {
+        return tableCache.getIfPresent(getKey(queryId, catalogName, tableIdentifier));
+      }
+      return null;
+    }
+
+    public static void addTable(Configuration conf, String catalogName, String tableIdentifier, Table table) {

Review comment:
       I was trying to suggest to add such a method to the `TableCache` class. I think it could make the code easier to read. What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org