You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org> on 2019/01/29 18:42:55 UTC

[Impala-ASF-CR] IMPALA-6932: Speed up scans for sequence datasets with many files

Pooja Nilangekar has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/11517 )

Change subject: IMPALA-6932: Speed up scans for sequence datasets with many files
......................................................................

IMPALA-6932: Speed up scans for sequence datasets with many files

This change addresses the slow scans of sequence datasets with
many files by enqueueing the scan ranges to the head of the disk
IO queue instead of the tail. This ensures that the data ranges
get priority over headers of other files. Hence it produces
results earlier for limit queries.

Testing:
Added a unit test to verify that the expected elements are
dequeued from the front.

Tested the performance of this patch on S3 to emulate remote reads.
The following query was executed several times:
"SELECT * FROM TPCH_AVRO.LINEITEM LIMIT 1;"
The average timeline difference was 8.66s vs 5.87s. The scanner I/O
wait time went down from 2.37s to 9.85s.

Tested the patch with backend and end-to-end tests.
Single node performance test results:
+----------+--------------------+---------+------------+------------+----------------+
| Workload | File Format        | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+--------------------+---------+------------+------------+----------------+
| TPCH(50) | avro / none / none | 65.62   | -0.38%     | 43.51      | -0.79%         |
+----------+--------------------+---------+------------+------------+----------------+

Change-Id: I211e2511ea3bb5edea29f1bd63e6b1fa4c4b1965
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/runtime/io/disk-io-mgr-stress.cc
M be/src/runtime/io/disk-io-mgr-test.cc
M be/src/runtime/io/request-context.cc
M be/src/runtime/io/request-context.h
M be/src/util/internal-queue-test.cc
M be/src/util/internal-queue.h
13 files changed, 159 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/17/11517/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11517
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I211e2511ea3bb5edea29f1bd63e6b1fa4c4b1965
Gerrit-Change-Number: 11517
Gerrit-PatchSet: 4
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>