You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/02/22 21:23:01 UTC
[jira] [Resolved] (IMPALA-6564) Queries randomly fail with "CANCELLED" due to a race with IssueInitialRanges()

     [ https://issues.apache.org/jira/browse/IMPALA-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-6564.
-----------------------------------
    Resolution: Not A Bug

It turns out that the issue was introduced in one of my patches:
https://gerrit.cloudera.org/#/c/8707/32/be/src/runtime/io/disk-io-mgr.cc

This check was saving us:
{noformat}
Status DiskIoMgr::AddScanRanges(RequestContext* reader,
    const vector<ScanRange*>& ranges, bool schedule_immediately) {
  if (ranges.empty()) return Status::OK();
{noformat}

> Queries randomly fail with "CANCELLED" due to a race with IssueInitialRanges()
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-6564
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6564
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.12.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Blocker
>              Labels: flaky
>
> I've been chasing a flaky test that I saw in test_basic_runtime_filters when running against https://gerrit.cloudera.org/#/c/8966/ (the scanner buffer pool changes).
> I think it is a latent bug that has started reproducing more frequently. What I've found is:
> * Different queries fail with CANCELLED. I can repro it on my branch ~3/4 times by running: impala-py.test tests/query_test/test_runtime_filters.py -n8 --verbose --maxfail 1 -k basic . It happens with a variety of queries and file formats.
> * It seems to happen when all files are pruned out by runtime filters
> * Logging reveals IssueInitialRanges() fails with a CANCELLED status, which propagates up to the query status:
> {code}
>   if (!initial_ranges_issued_) {
>     // We do this in GetNext() to maximise the amount of work we can do while waiting for
>     // runtime filters to show up. The scanner threads have already started (in Open()),
>     // so we need to tell them there is work to do.
>     // TODO: This is probably not worth splitting the organisational cost of splitting
>     // initialisation across two places. Move to before the scanner threads start.
>     Status status = IssueInitialScanRanges(state);
>     if (!status.ok()) LOG(INFO) << runtime_state_->fragment_instance_id() << " IssueInitialRanges() failed with status: " << status.GetDetail()  << " " << (void*) this;
> {code}
> * It appears that the CANCELLED comes from DiskIoMgr::AddScanRanges().
> * That function returned cancelled because a scanner thread noticed that the scan was complete here and cancelled the RequestContext:
> {code}
>     // Done with range and it completed successfully
>     if (progress_.done()) {
>       // All ranges are finished.  Indicate we are done.
>       LOG(INFO) << runtime_state_->fragment_instance_id() << " All ranges done " << (void*) this;
>       SetDone();
>       break;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)