You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Selvaraj periyasamy (Jira)" <ji...@apache.org> on 2020/11/01 09:13:00 UTC

[jira] [Commented] (HUDI-1365) Listing leaf files and directories is very Slow

    [ https://issues.apache.org/jira/browse/HUDI-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224237#comment-17224237 ] 

Selvaraj periyasamy commented on HUDI-1365:
-------------------------------------------

Below is the job detail.

!image-2020-11-01-01-11-11-561.png!

> Listing leaf files and directories is very Slow
> -----------------------------------------------
>
>                 Key: HUDI-1365
>                 URL: https://issues.apache.org/jira/browse/HUDI-1365
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Selvaraj periyasamy
>            Priority: Major
>         Attachments: Log.txt, image-2020-11-01-01-11-11-561.png
>
>
> I am using huh 0.5.0 . I took 0.5.0 and used the changes for HoodieROTablePathFilter from HUDI-1144.  Even though it caches, I am seeing only 46 directories cached in 1 min. Due to this, My job takes lot of time to write. because I have 6 months worth of hourly partitions.
>  
> HoodieTableMetaClient metaClient = metaClientCache.get(baseDir.toString());
> if (null == metaClient) {
>  metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString(), true);
>  metaClientCache.put(baseDir.toString(), metaClient);
> }
> HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient,
>  metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), fs.listStatus(folder));
> List<HoodieDataFile> latestFiles = fsView.getLatestDataFiles().collect(Collectors.toList());
> // populate the cache
> if (!hoodiePathCache.containsKey(folder.toString())) {
>  hoodiePathCache.put(folder.toString(), new HashSet<>());
> }
> LOG.info("Custom Code : Based on hoodie metadata from base path: " + baseDir.toString() + ", caching " + latestFiles.size()
>  + " files under " + folder);
> for (HoodieDataFile lfile : latestFiles) {
>  hoodiePathCache.get(folder.toString()).add(new Path(lfile.getPath()));
> }
>  
>  
>  
> Sample Logs here. I have attached the log file as well.
>  
> 20/11/01 08:16:00 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/08, #FileGroups=2
> 20/11/01 08:16:00 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:00 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/08
> 20/11/01 08:16:01 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/09, #FileGroups=2
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/09
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/10, #FileGroups=3
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/10
> 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1009.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/11, #FileGroups=2
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/11
> 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/12, #FileGroups=3
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=10, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/12
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/13, #FileGroups=2
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/13
> 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1009.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:02 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/14, #FileGroups=2
> 20/11/01 08:16:02 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/11/01 08:16:02 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 2 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/14
> 20/11/01 08:16:03 WARN LoadBalancingKMSClientProvider: KMS provider at [http://sl73caehmpc1010.visa.com:9292/kms/v1/] threw an IOException!! java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
> 20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/15, #FileGroups=3
> 20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=1, StoreTimeTaken=0
> 20/11/01 08:16:03 INFO HoodieROTablePathFilter: Custom Code : Based on hoodie metadata from base path: hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr, caching 3 files under hdfs://nameservice1/projects/cdp/data/cdp_reporting/trr/20200919/15
> 20/11/01 08:16:03 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200919/16, #FileGroups=2
> 20/11/01 08:16:03 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=7, FileGroupsCreationTime=0, StoreTimeTaken=0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)