You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Alex Rodoni (JIRA)" <ji...@apache.org> on 2018/08/30 18:27:00 UTC

[jira] [Updated] (IMPALA-4172) Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)

     [ https://issues.apache.org/jira/browse/IMPALA-4172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rodoni updated IMPALA-4172:
--------------------------------
    Docs Text:   (was: Improves the performance of block metadata fetching by the Catalog server from the Namenode by substantially reducing the number of RPCs.)

> Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-4172
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4172
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.8.0
>            Reporter: Mostafa Mokhtar
>            Assignee: bharath v
>            Priority: Critical
>              Labels: performance, ramp-up
>             Fix For: Impala 2.8.0
>
>         Attachments: query_after_invalidate_store_sales_800Kfiles_test.jfr
>
>
> HDFS-8895 removes the ability to query volume IDs from datanodes. This information has instead been added to BlockLocation, which is accessible via various FileSystem APIs (namely, anything that returns LocatedFileStatus).
> This new API is more efficient and more accurate. It's also available from CDH5.5 onwards, so can be backported as well.
> getFileBlockLocations is a bottle neck during metadata loading for Impala.
> {code}
> Stack Trace	Sample Count	Percentage(%)
> java.lang.Thread.run()	17,837	73.758
>    java.util.concurrent.ThreadPoolExecutor$Worker.run()	17,837	73.758
>       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)	17,837	73.758
>          java.util.concurrent.FutureTask.run()	17,600	72.778
>             com.cloudera.impala.catalog.TableLoadingMgr$2.call()	17,513	72.419
>                com.cloudera.impala.catalog.TableLoadingMgr$2.call()	17,513	72.419
>                   com.cloudera.impala.catalog.TableLoader.load(Db, String)	17,513	72.419
>                      com.cloudera.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table)	17,513	72.419
>                         com.cloudera.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table, boolean, boolean, Set)	17,513	72.419
>                            com.cloudera.impala.catalog.HdfsTable.loadAllPartitions(List, Table)	15,721	65.008
>                               com.cloudera.impala.catalog.HdfsTable.createPartition(StorageDescriptor, Partition, Map)	13,611	56.283
>                                  com.cloudera.impala.catalog.HdfsTable.updatePartitionFds(Path, boolean, HdfsFileFormat, Map)	7,942	32.841
>                                     com.cloudera.impala.catalog.HdfsTable.loadBlockMetadata(FileSystem, FileStatus, HdfsPartition$FileDescriptor, HdfsFileFormat, Map)	4,319	17.86
>                                        org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(FileStatus, long, long)	3,678	15.209
>                                        com.cloudera.impala.catalog.HdfsPartition$BlockReplica.parseLocation(String)	203	0.839
> {code}
> Pointer to the JAVA docs for the new API
> [https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path, boolean)]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org