You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by st...@apache.org on 2020/10/02 00:12:52 UTC
[impala] 02/02: IMPALA-9606: ABFS reads should use hdfsPreadFully
This is an automated email from the ASF dual-hosted git repository.
stakiar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 8e9cf51f6b328f500acf7c577289c5b888fd15d2
Author: Sahil Takiar <ta...@gmail.com>
AuthorDate: Thu Oct 1 10:31:22 2020 -0700
IMPALA-9606: ABFS reads should use hdfsPreadFully
Similar to IMPALA-8525, but for ABFS, instead of S3A.
I don't expect this to make a major improvement in performance,
like it did for S3A, although I am still seeing a marginal
improvement during some ad-hoc testing (about 5% scan perf
improvement). The reason is that the implementation of the ABFS
and S3A clients are very different, ABFS already reads all data
requested in a single hdfsRead call.
I ran the query 'select * from abfs_test_store_sales order by
ss_net_profit limit 10;' several times to validate that perf
does not regress. In fact, it does improve slightly for this query.
The table 'abfs_test_store_sales' is just a copy of the mini-cluster's
tpcds_parquet.store_sales, although it is not partitioned.
Testing:
* Tested against a ABFS storage account I have access to
* Ran several queries to validate there are no functional
or perf regressions.
Change-Id: I994ea30cf31abc66f5d82d9b3c8e185d2bd06147
Reviewed-on: http://gerrit.cloudera.org:8080/16531
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
be/src/runtime/io/hdfs-file-reader.cc | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/be/src/runtime/io/hdfs-file-reader.cc b/be/src/runtime/io/hdfs-file-reader.cc
index 2f985cf..34c5d61 100644
--- a/be/src/runtime/io/hdfs-file-reader.cc
+++ b/be/src/runtime/io/hdfs-file-reader.cc
@@ -36,7 +36,7 @@
DEFINE_bool(use_hdfs_pread, false, "Enables using hdfsPread() instead of hdfsRead() "
"when performing HDFS read operations. This is necessary to use HDFS hedged reads "
"(assuming the HDFS client is configured to do so). Preads are always enabled for "
- "S3A reads.");
+ "S3A and ABFS reads.");
DEFINE_int64(fs_slow_read_log_threshold_ms, 10L * 1000L,
"Log diagnostics about I/Os issued via the HDFS client that take longer than this "
@@ -221,7 +221,8 @@ Status HdfsFileReader::ReadFromPosInternal(hdfsFile hdfs_file, DiskQueue* queue,
ScopedHistogramTimer read_timer(queue->read_latency());
// For file handles from the cache, any of the below file operations may fail
// due to a bad file handle.
- if (FLAGS_use_hdfs_pread || IsS3APath(scan_range_->file_string()->c_str())) {
+ if (FLAGS_use_hdfs_pread || IsS3APath(scan_range_->file_string()->c_str())
+ || IsABFSPath(scan_range_->file_string()->c_str())) {
if (hdfsPreadFully(
hdfs_fs_, hdfs_file, position_in_file, buffer, bytes_to_read) == -1) {
return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(),