You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/11/06 00:24:00 UTC

[jira] [Commented] (IMPALA-10147) Avoid getting a file handle for data cache hits

    [ https://issues.apache.org/jira/browse/IMPALA-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629399#comment-17629399 ] 

ASF subversion and git services commented on IMPALA-10147:
----------------------------------------------------------

Commit d1d4f183da069b967f7120acfc040e3f6a3598a1 in impala's branch refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d1d4f183d ]

IMPALA-11704: Delay hdfsOpenFile with data cache

Delays hdfsOpenFile until after data cache lookup if using a data cache.
IMPALA-10147 implemented this, but only when using the file handle
cache. This patch adds an additional check in case file handle caching
is disabled.

In networked environments, hdfsOpenFile can take significant time, as
observed in a TPC-DS run of q90 where TotalRawHdfsOpenFileTime
represented a majority of time spent for HDFS_SCAN_NODE. This patch
brings that time to 0 with a primed data cache.

Change-Id: I9429a41fb16de27ccb57730203f95559df0dbfb6
Reviewed-on: http://gerrit.cloudera.org:8080/19204
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Avoid getting a file handle for data cache hits
> -----------------------------------------------
>
>                 Key: IMPALA-10147
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10147
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.0.0
>            Reporter: Joe McDonnell
>            Assignee: Riza Suminto
>            Priority: Critical
>             Fix For: Impala 4.0.0
>
>
> When reading from the data cache, the DiskIo thread first gets a file handle, then it checks the data cache for a hit. If there is a cache hit, then the file handle is not actually used. It is only used if there is a cache miss. There is no real reason to have the file handle open for cache hits. It doesn't really serve any additional purpose, and it adds overhead to cache hits.
> For platforms that do not have the file handle cache, this can be a significant overhead.
> We should only open the file handle after we have checked the data cache and know that we need to read from regular storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org