You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2020/04/06 03:42:00 UTC

[jira] [Created] (IMPALA-9606) ABFS reads should use hdfsPreadFully

Sahil Takiar created IMPALA-9606:
------------------------------------

Summary: ABFS reads should use hdfsPreadFully
Key: IMPALA-9606
URL: https://issues.apache.org/jira/browse/IMPALA-9606
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar

In IMPALA-8525, hdfs preads were enabled by default when reading data from S3. IMPALA-8525 deferred enabling preads for ABFS because they didn't significantly improve performance. After some more investigation into the ABFS input streams, I think it is safe to use {{hdfsPreadFully}} for ABFS reads.

The ABFS client uses a different model for fetching data compared to S3A. Details are beyond the scope of this JIRA, but it is related to a feature in ABFS called "read-aheads". ABFS has logic to pre-fetch data it *thinks* will be required by the client. By default, it pre-fetches # cores * 4 MB of data. If the requested data exists in the client cache, it is read from the cache.

However, there is no real drawback to using {{hdfsPreadFully}} for ABFS reads. It's definitely safer, because while the current implementation of ABFS always returns the amount of requested data, only the {{hdfsPreadFully}} API makes that guarantee.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org