You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/10/02 00:13:00 UTC
[jira] [Commented] (IMPALA-9606) ABFS reads should use hdfsPreadFully

    [ https://issues.apache.org/jira/browse/IMPALA-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205896#comment-17205896 ] 

ASF subversion and git services commented on IMPALA-9606:
---------------------------------------------------------

Commit 8e9cf51f6b328f500acf7c577289c5b888fd15d2 in impala's branch refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8e9cf51 ]

IMPALA-9606: ABFS reads should use hdfsPreadFully

Similar to IMPALA-8525, but for ABFS, instead of S3A.
I don't expect this to make a major improvement in performance,
like it did for S3A, although I am still seeing a marginal
improvement during some ad-hoc testing (about 5% scan perf
improvement). The reason is that the implementation of the ABFS
and S3A clients are very different, ABFS already reads all data
requested in a single hdfsRead call.

I ran the query 'select * from abfs_test_store_sales order by
ss_net_profit limit 10;' several times to validate that perf
does not regress. In fact, it does improve slightly for this query.
The table 'abfs_test_store_sales' is just a copy of the mini-cluster's
tpcds_parquet.store_sales, although it is not partitioned.

Testing:
* Tested against a ABFS storage account I have access to
* Ran several queries to validate there are no functional
  or perf regressions.

Change-Id: I994ea30cf31abc66f5d82d9b3c8e185d2bd06147
Reviewed-on: http://gerrit.cloudera.org:8080/16531
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> ABFS reads should use hdfsPreadFully
> ------------------------------------
>
>                 Key: IMPALA-9606
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9606
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> In IMPALA-8525, hdfs preads were enabled by default when reading data from S3. IMPALA-8525 deferred enabling preads for ABFS because they didn't significantly improve performance. After some more investigation into the ABFS input streams, I think it is safe to use {{hdfsPreadFully}} for ABFS reads.
> The ABFS client uses a different model for fetching data compared to S3A. Details are beyond the scope of this JIRA, but it is related to a feature in ABFS called "read-aheads". ABFS has logic to pre-fetch data it *thinks* will be required by the client. By default, it pre-fetches # cores * 4 MB of data. If the requested data exists in the client cache, it is read from the cache.
> However, there is no real drawback to using {{hdfsPreadFully}} for ABFS reads. It's definitely safer, because while the current implementation of ABFS always returns the amount of requested data, only the {{hdfsPreadFully}} API makes that guarantee.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org