You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Miklos Szurap (JIRA)" <ji...@apache.org> on 2018/09/28 16:06:00 UTC

[jira] [Updated] (IMPALA-7642) Optimize UDF jar handling in Catalog

     [ https://issues.apache.org/jira/browse/IMPALA-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Miklos Szurap updated IMPALA-7642:
----------------------------------
    Description: 
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956] is called, which calls [extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68] for each function found in HMS, and for each function the related UDF jar file is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not uncommon that the UDFs are not packaged separately, but in everything-in-one big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of functions in a database (which usually related to the same project) and all functions are pointing to the same UDF jar. The above method hundreds of times downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a HashMap<String,String> (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache already contains the jarUri. If not, downloads the jar, and puts it into the cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to loadJavaFunctions - in a finally block iterate over the cache entries (values) and delete the local files, and on the end clear the cache.

2. Use {{Set<String>}} instead of {{List<String>}} for addedSignatures in [FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
It just tracks which function signatures were added, for that purpose a Set is fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster ( {{O( 1 )}} ) with a HashSet (compared to ArrayList's {{O( n )}} for the contains method).

  was:
1. Optimize UDF jar loading
During startup and global invalidate metadata calls, for each database the [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956] is called, which calls [extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68] for each function found in HMS, and for each function the related UDF jar file is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not uncommon that the UDFs are not packaged separately, but in everything-in-one big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of functions in a database (which usually related to the same project) and all functions are pointing to the same UDF jar. The above method hundreds of times downloads the same jar, "extracts the function" and deletes the local file.
The suggestion would be to improve this by:
- creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a HashMap<String,String> (map of jarUri -> localJarPath)
- pass this cache to FunctionUtils.extractFunctions, which checks if the cache already contains the jarUri. If not, downloads the jar, and puts it into the cache (and does everything else needed)
- move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to loadJavaFunctions - in a finally block iterate over the cache entries (values) and delete the local files, and on the end clear the cache.

2. Use {{Set<String>}} instead of {{List<String>}} for addedSignatures in [FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
It just tracks which function signatures were added, for that purpose a Set is fine. 
{noformat}
if (!addedSignatures.contains(fn.signatureString())){noformat}
This would be faster ( {{O(1)}} ) with a HashSet (compared to ArrayList's {{O(n)}} for the contains method).


> Optimize UDF jar handling in Catalog
> ------------------------------------
>
>                 Key: IMPALA-7642
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7642
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 3.0
>            Reporter: Miklos Szurap
>            Priority: Major
>
> 1. Optimize UDF jar loading
> During startup and global invalidate metadata calls, for each database the [CatalogServiceCatalog.loadJavaFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L956] is called, which calls [extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L68] for each function found in HMS, and for each function the related UDF jar file is downloaded from HDFS to the localLibraryPath (file:///tmp). It is not uncommon that the UDFs are not packaged separately, but in everything-in-one big-fat jars, so they can be 10-50 MB of size. Sometimes there are hundreds of functions in a database (which usually related to the same project) and all functions are pointing to the same UDF jar. The above method hundreds of times downloads the same jar, "extracts the function" and deletes the local file.
> The suggestion would be to improve this by:
> - creating a local "cache" in CatalogServiceCatalog.loadJavaFunctions() as a HashMap<String,String> (map of jarUri -> localJarPath)
> - pass this cache to FunctionUtils.extractFunctions, which checks if the cache already contains the jarUri. If not, downloads the jar, and puts it into the cache (and does everything else needed)
> - move the FileSystemUtil.deleteIfExists(localJarPath) from extractFunctions to loadJavaFunctions - in a finally block iterate over the cache entries (values) and delete the local files, and on the end clear the cache.
> 2. Use {{Set<String>}} instead of {{List<String>}} for addedSignatures in [FunctionUtils.extractFunctions()|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/FunctionUtils.java#L73]:
> It just tracks which function signatures were added, for that purpose a Set is fine. 
> {noformat}
> if (!addedSignatures.contains(fn.signatureString())){noformat}
> This would be faster ( {{O( 1 )}} ) with a HashSet (compared to ArrayList's {{O( n )}} for the contains method).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org