You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by st...@apache.org on 2020/10/02 00:12:52 UTC

[impala] 02/02: IMPALA-9606: ABFS reads should use hdfsPreadFully

This is an automated email from the ASF dual-hosted git repository.

stakiar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 8e9cf51f6b328f500acf7c577289c5b888fd15d2
Author: Sahil Takiar <ta...@gmail.com>
AuthorDate: Thu Oct 1 10:31:22 2020 -0700

    IMPALA-9606: ABFS reads should use hdfsPreadFully
    
    Similar to IMPALA-8525, but for ABFS, instead of S3A.
    I don't expect this to make a major improvement in performance,
    like it did for S3A, although I am still seeing a marginal
    improvement during some ad-hoc testing (about 5% scan perf
    improvement). The reason is that the implementation of the ABFS
    and S3A clients are very different, ABFS already reads all data
    requested in a single hdfsRead call.
    
    I ran the query 'select * from abfs_test_store_sales order by
    ss_net_profit limit 10;' several times to validate that perf
    does not regress. In fact, it does improve slightly for this query.
    The table 'abfs_test_store_sales' is just a copy of the mini-cluster's
    tpcds_parquet.store_sales, although it is not partitioned.
    
    Testing:
    * Tested against a ABFS storage account I have access to
    * Ran several queries to validate there are no functional
      or perf regressions.
    
    Change-Id: I994ea30cf31abc66f5d82d9b3c8e185d2bd06147
    Reviewed-on: http://gerrit.cloudera.org:8080/16531
    Reviewed-by: Joe McDonnell <jo...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 be/src/runtime/io/hdfs-file-reader.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/be/src/runtime/io/hdfs-file-reader.cc b/be/src/runtime/io/hdfs-file-reader.cc
index 2f985cf..34c5d61 100644
--- a/be/src/runtime/io/hdfs-file-reader.cc
+++ b/be/src/runtime/io/hdfs-file-reader.cc
@@ -36,7 +36,7 @@
 DEFINE_bool(use_hdfs_pread, false, "Enables using hdfsPread() instead of hdfsRead() "
     "when performing HDFS read operations. This is necessary to use HDFS hedged reads "
     "(assuming the HDFS client is configured to do so). Preads are always enabled for "
-    "S3A reads.");
+    "S3A and ABFS reads.");
 
 DEFINE_int64(fs_slow_read_log_threshold_ms, 10L * 1000L,
     "Log diagnostics about I/Os issued via the HDFS client that take longer than this "
@@ -221,7 +221,8 @@ Status HdfsFileReader::ReadFromPosInternal(hdfsFile hdfs_file, DiskQueue* queue,
   ScopedHistogramTimer read_timer(queue->read_latency());
   // For file handles from the cache, any of the below file operations may fail
   // due to a bad file handle.
-  if (FLAGS_use_hdfs_pread || IsS3APath(scan_range_->file_string()->c_str())) {
+  if (FLAGS_use_hdfs_pread || IsS3APath(scan_range_->file_string()->c_str())
+      || IsABFSPath(scan_range_->file_string()->c_str())) {
     if (hdfsPreadFully(
           hdfs_fs_, hdfs_file, position_in_file, buffer, bytes_to_read) == -1) {
       return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(),