You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/09/16 16:25:58 UTC

[GitHub] [hive] sam-an-cloudera opened a new pull request #1503: HIVE-24169: HiveServer2 UDF cache

sam-an-cloudera opened a new pull request #1503:
URL: https://github.com/apache/hive/pull/1503


   
   ### What changes were proposed in this pull request?
   This feature adds a HiveServer2 level UDF cache that can be shared across sessions. 
   
   
   
   ### Why are the changes needed?
   Without this feature, on S3 based system, describe function and select udf() will be localized each and every time, taking 2 minutes to download 300mb. 
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   manual
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] commented on pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #1503:
URL: https://github.com/apache/hive/pull/1503#issuecomment-762545197


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] saihemanth-cloudera commented on pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
saihemanth-cloudera commented on pull request #1503:
URL: https://github.com/apache/hive/pull/1503#issuecomment-714803938


   > Can HDFS be used cache? In case where multiple HS2 are running, this will just cache the udf on one of the HS2. If user gets connected to other HS2, again UDF needs to be downloaded and cached.
   We don't have an HDFS in our use-case. We have on-prem s3 storage and an efs file system in hs2. Can you create a seperate jira so that we can track the proposed optimization in that one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] closed pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #1503:
URL: https://github.com/apache/hive/pull/1503


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] adesh-rao commented on pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
adesh-rao commented on pull request #1503:
URL: https://github.com/apache/hive/pull/1503#issuecomment-700417533


   Can HDFS be used cache? In case where multiple HS2 are running, this will just cache the udf on one of the HS2. If user gets connected to other HS2, again UDF needs to be downloaded and cached.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] saihemanth-cloudera commented on a change in pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
saihemanth-cloudera commented on a change in pull request #1503:
URL: https://github.com/apache/hive/pull/1503#discussion_r527035484



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/util/ResourceDownloader.java
##########
@@ -71,42 +87,61 @@ public static boolean isFileUri(String value) {
     }
   }
 
-  public List<URI> resolveAndDownload(String source, boolean convertToUnix)
+  public List<URI> resolveAndDownload(String source, boolean convertToUnix, boolean useCache)
       throws URISyntaxException, IOException {
-    return resolveAndDownloadInternal(createURI(source), null, convertToUnix, true);
+    return resolveAndDownloadInternal(createURI(source), null, convertToUnix, true, useCache);
   }
 
   public List<URI> downloadExternal(URI source, String subDir, boolean convertToUnix)
       throws URISyntaxException, IOException {
-    return resolveAndDownloadInternal(source, subDir, convertToUnix, false);
+    return resolveAndDownloadInternal(source, subDir, convertToUnix, false, false);
+  }
+  public List<URI> downloadExternal(URI source, String subDir, boolean convertToUnix, boolean useCache)
+      throws URISyntaxException, IOException {
+    return resolveAndDownloadInternal(source, subDir, convertToUnix, false, useCache);
   }
 
   private List<URI> resolveAndDownloadInternal(URI source, String subDir,
-      boolean convertToUnix, boolean isLocalAllowed) throws URISyntaxException, IOException {
+      boolean convertToUnix, boolean isLocalAllowed, boolean useCache) throws URISyntaxException, IOException {
     switch (getURLType(source)) {
     case FILE: return isLocalAllowed ? Collections.singletonList(source) : null;
     case IVY: return dependencyResolver.downloadDependencies(source);
     case HDFS:
     case OTHER:
-      return Collections.singletonList(createURI(downloadResource(source, subDir, convertToUnix)));
+      return Collections.singletonList(createURI(downloadResource(source, subDir, convertToUnix, useCache)));

Review comment:
       Caching for HDFS is more abstract optimization than what we are working on. It can be tracked in a separate JIRA.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] vihangk1 commented on a change in pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
vihangk1 commented on a change in pull request #1503:
URL: https://github.com/apache/hive/pull/1503#discussion_r516198595



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -448,8 +450,10 @@ public SessionState(HiveConf conf, String userName) {
         parentLoader, Collections.emptyList(), true);
     final ClassLoader currentLoader = AccessController.doPrivileged(addAction);
     this.sessionConf.setClassLoader(currentLoader);
+    Map<String, String> udfCacheMap = getUDFCacheMap();

Review comment:
       the map is already accessible to this class and there is no need to use a getter here.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Registry.java
##########
@@ -583,7 +583,13 @@ public void unregisterFunction(String functionName) throws HiveException {
         }
         mFunctions.remove(functionName);
         fi.discarded();
+        FunctionResource[] resources = fi.getResources();
         if (fi.isPersistent()) {
+          Map<String, String> udfCacheMap = SessionState.getUDFCacheMap();
+          for(FunctionResource fr : resources){
+            //remove from udf cache if it's saved.
+            udfCacheMap.remove(fr.getResourceURI());

Review comment:
       Do we need to remove the downloaded file here too?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/util/ResourceDownloader.java
##########
@@ -71,42 +87,61 @@ public static boolean isFileUri(String value) {
     }
   }
 
-  public List<URI> resolveAndDownload(String source, boolean convertToUnix)
+  public List<URI> resolveAndDownload(String source, boolean convertToUnix, boolean useCache)
       throws URISyntaxException, IOException {
-    return resolveAndDownloadInternal(createURI(source), null, convertToUnix, true);
+    return resolveAndDownloadInternal(createURI(source), null, convertToUnix, true, useCache);
   }
 
   public List<URI> downloadExternal(URI source, String subDir, boolean convertToUnix)
       throws URISyntaxException, IOException {
-    return resolveAndDownloadInternal(source, subDir, convertToUnix, false);
+    return resolveAndDownloadInternal(source, subDir, convertToUnix, false, false);
+  }
+  public List<URI> downloadExternal(URI source, String subDir, boolean convertToUnix, boolean useCache)
+      throws URISyntaxException, IOException {
+    return resolveAndDownloadInternal(source, subDir, convertToUnix, false, useCache);
   }
 
   private List<URI> resolveAndDownloadInternal(URI source, String subDir,
-      boolean convertToUnix, boolean isLocalAllowed) throws URISyntaxException, IOException {
+      boolean convertToUnix, boolean isLocalAllowed, boolean useCache) throws URISyntaxException, IOException {
     switch (getURLType(source)) {
     case FILE: return isLocalAllowed ? Collections.singletonList(source) : null;
     case IVY: return dependencyResolver.downloadDependencies(source);
     case HDFS:
     case OTHER:
-      return Collections.singletonList(createURI(downloadResource(source, subDir, convertToUnix)));
+      return Collections.singletonList(createURI(downloadResource(source, subDir, convertToUnix, useCache)));

Review comment:
       Looks like we are going to use the cache for HDFS source URLs too. Do we need to do that for HDFS as well? May be use cache only in case it is a remote file system.

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -491,6 +495,16 @@ public void resetThreadName() {
       Thread.currentThread().setName(names[names.length - 1].trim());
     }
   }
+  public static Map<String, String> getUDFCacheMap(){
+    return udfLocalResource;
+  }
+
+  public synchronized static File getUdfFileDir(){
+    if(udfFileDir == null){

Review comment:
       formatting is off here.

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -3552,6 +3552,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "The parent node in ZooKeeper used by HiveServer2 when supporting dynamic service discovery."),
     HIVE_SERVER2_ZOOKEEPER_PUBLISH_CONFIGS("hive.server2.zookeeper.publish.configs", true,
         "Whether we should publish HiveServer2's configs to ZooKeeper."),
+    HIVE_SERVER2_UDF_CACHE_ENABLED("hive.server2.udf.cache.enabled", false,
+        "Whether HiveServer2 UDF cache is enabled. Disabled by default."),

Review comment:
       I think it would be good to more a more detailed information here. The point of having a description field is to have more details than the name itself :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] saihemanth-cloudera commented on a change in pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
saihemanth-cloudera commented on a change in pull request #1503:
URL: https://github.com/apache/hive/pull/1503#discussion_r527088402



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -448,8 +450,10 @@ public SessionState(HiveConf conf, String userName) {
         parentLoader, Collections.emptyList(), true);
     final ClassLoader currentLoader = AccessController.doPrivileged(addAction);
     this.sessionConf.setClassLoader(currentLoader);
+    Map<String, String> udfCacheMap = getUDFCacheMap();

Review comment:
       Addressed in https://github.com/apache/hive/pull/1690/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] saihemanth-cloudera commented on a change in pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
saihemanth-cloudera commented on a change in pull request #1503:
URL: https://github.com/apache/hive/pull/1503#discussion_r527088254



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##########
@@ -491,6 +495,16 @@ public void resetThreadName() {
       Thread.currentThread().setName(names[names.length - 1].trim());
     }
   }
+  public static Map<String, String> getUDFCacheMap(){
+    return udfLocalResource;
+  }
+
+  public synchronized static File getUdfFileDir(){
+    if(udfFileDir == null){

Review comment:
       Addressed in https://github.com/apache/hive/pull/1690/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] saihemanth-cloudera commented on a change in pull request #1503: HIVE-24169: HiveServer2 UDF cache

Posted by GitBox <gi...@apache.org>.
saihemanth-cloudera commented on a change in pull request #1503:
URL: https://github.com/apache/hive/pull/1503#discussion_r527035374



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/Registry.java
##########
@@ -583,7 +583,13 @@ public void unregisterFunction(String functionName) throws HiveException {
         }
         mFunctions.remove(functionName);
         fi.discarded();
+        FunctionResource[] resources = fi.getResources();
         if (fi.isPersistent()) {
+          Map<String, String> udfCacheMap = SessionState.getUDFCacheMap();
+          for(FunctionResource fr : resources){
+            //remove from udf cache if it's saved.
+            udfCacheMap.remove(fr.getResourceURI());

Review comment:
       Yeah we need to clear the downloaded files, when unregistering a function.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org