You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/01/31 19:08:00 UTC

[jira] [Commented] (IMPALA-6932) Simple LIMIT 1 query can be really slow on many-filed sequence datasets

    [ https://issues.apache.org/jira/browse/IMPALA-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757637#comment-16757637 ] 

ASF subversion and git services commented on IMPALA-6932:
---------------------------------------------------------

Commit 653ff1585daf1ae0f1c914a3d03581e6ca80c47f in impala's branch refs/heads/master from poojanilangekar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=653ff15 ]

IMPALA-6932: Speed up scans for sequence datasets with many files

This change addresses the slow scans of sequence datasets with
many files by enqueueing the scan ranges to the head of the disk
IO queue instead of the tail. This ensures that the data ranges
get priority over headers of other files. Hence it produces
results earlier for limit queries.

Testing:
Added a unit test to verify that the expected elements are
dequeued from the front.

Tested the performance of this patch on S3 to emulate remote reads.
The following query was executed several times:
"SELECT * FROM TPCH_AVRO.LINEITEM LIMIT 1;"
The average timeline difference was 8.66s vs 5.87s. The scanner I/O
wait time went down from 2.37s to 9.85s.

Tested the patch with backend and end-to-end tests.
Single node performance test results:
+----------+--------------------+---------+------------+------------+----------------+
| Workload | File Format        | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+--------------------+---------+------------+------------+----------------+
| TPCH(50) | avro / none / none | 65.62   | -0.38%     | 43.51      | -0.79%         |
+----------+--------------------+---------+------------+------------+----------------+

Change-Id: I211e2511ea3bb5edea29f1bd63e6b1fa4c4b1965
Reviewed-on: http://gerrit.cloudera.org:8080/11517
Reviewed-by: Philip Zeyliger <ph...@cloudera.com>
Tested-by: Philip Zeyliger <ph...@cloudera.com>


> Simple LIMIT 1 query can be really slow on many-filed sequence datasets
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-6932
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6932
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>            Reporter: Philip Zeyliger
>            Assignee: Pooja Nilangekar
>            Priority: Critical
>
> I recently ran across really slow behavior with the trivial {{SELECT * FROM table LIMIT 1}} query. The table used Avro as a file format and had about 45,000 files across about 250 partitions. An optimization kicked in to set NUM_NODES to 1.
> The query ran for about an hour, and the profile indicated that it was opening files:
>           - TotalRawHdfsOpenFileTime(*): 1.0h (3622833666032)
> I took a single minidump while this query was running, and I suspect the query was here:
> {code:java}
> 1 impalad!impala::ScannerContext::Stream::GetNextBuffer(long) [scanner-context.cc : 115 + 0x13]
> 2 impalad!impala::ScannerContext::Stream::GetBytesInternal(long, unsigned char**, bool, long*) [scanner-context.cc : 241 + 0x5]
> 3 impalad!impala::HdfsAvroScanner::ReadFileHeader() [scanner-context.inline.h : 54 + 0x1f]
> 4 impalad!impala::BaseSequenceScanner::GetNextInternal(impala::RowBatch*) [base-sequence-scanner.cc : 157 + 0x13]
> 5 impalad!impala::HdfsScanner::ProcessSplit() [hdfs-scanner.cc : 129 + 0xc]
> 6 impalad!impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const&, impala::MemPool*, impala::io::ScanRange*) [hdfs-scan-node.cc : 527 + 0x17]
> 7 impalad!impala::HdfsScanNode::ScannerThread() [hdfs-scan-node.cc : 437 + 0x1c]
> 8 impalad!impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long>*) [function_template.hpp : 767 + 0x7]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org