You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/04 08:17:57 UTC

[GitHub] [iceberg] chenjunjiedada opened a new pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

chenjunjiedada opened a new pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295


   A long-running service may invoke different spark sessions for accessing the iceberg table. The catalog cache may grow constantly and exceed the connection limit and thus impact the following service. Here add an expiration strategy to catalog cache to avoid the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r587977029



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       I am aware of #1674. This is slightly different from #1674, the catalog is bounded to spark session.  I'm thinking whether this can be configurable so that the user could set an expiration strategy according to different scenarios. While it is neither a table property nor a spark session property.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#issuecomment-800205267


   #2325 should be a better solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r587977029



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       I am aware of #1674. This is slightly different from #1674, the catalog is bounded to spark session.  I'm thinking whether this can be configurable so that the user could set an expiration strategy according to different scenarios. While it is neither not a table property nor a spark session property.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rymurr commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
rymurr commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r595051241



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       I agree, I think the timed expiration is potentially worse here. If we expire by time then we will be evicting a catalog that is definitely still being used by a `SparkSession`. Whereas before it was _maybe_ being used. 
   
   I think that #2325 may actually be the right solution. It moves to a cross catalog pool and handles reclaiming client connections regardless of the underlying catalog. Then any catalogs in this cache will have 0 connections attached to them and can stay in the cache forever. @lcspinter do you agree?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r595107288



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       Great! I checked the PR, it looks very promising! Thanks @RussellSpitzer @aokolnychyi @rymurr!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r594813809



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       While it is bound to `SparkSession`, it seems introducing a cache timeout can still cause the issues Russell mentioned above.
   
   cc @rymurr @rdblue who developed these places 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r587602930



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       I'm worried this will cause the same issues we had here https://github.com/apache/iceberg/pull/1674 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295#discussion_r587252717



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/CustomCatalogs.java
##########
@@ -40,6 +43,12 @@
 
 public final class CustomCatalogs {
   private static final Cache<Pair<SparkSession, String>, Catalog> CATALOG_CACHE = Caffeine.newBuilder()
+      .expireAfterAccess(10, TimeUnit.MINUTES)

Review comment:
       Should this be configurable?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada closed pull request #2295: Spark: expire catalog cache to avoid hive connection overflow

Posted by GitBox <gi...@apache.org>.
chenjunjiedada closed pull request #2295:
URL: https://github.com/apache/iceberg/pull/2295


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org