You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Todd Lipcon <to...@cloudera.com> on 2018/08/30 19:25:05 UTC

Issues with query/fragment lifecycle on trunk?

Hey folks,

I ran into some issues with a local core test run last night. This is on my
own branch with some uncommitted work, but I haven't touched the backend
and these issues seem to be backend-focused.

Initially, I noticed the problem because a test run that I started last
night was still "stuck" this morning. I had three queries which had been
running for 10 hours and not making progress. The 'fragments' page for the
query showed that one of the backends had not reported to the coordinator
for many hours. In attempting to debug this, I managed to get the
StateStore to declare that node dead, and those queries eventually were
cancelled due to that.

I resumed the node that I had been debugging with gdb, and it was declared
live again. I didn't restart the process, though, which might have led to
further issues:

Next, I also noticed that /queries on my coordinator node showed "Number of
running queries with fragments on this host" at 100+ on all three nodes,
even though no queries were running anymore. These numbers are stable. I
can continue to issue new queries and they complete normally. While a
queries are running, the count of fragments goes up appropriate, and when
the query finishes, it goes back down. But, the "base" numbers are stuck at
{108, 391, 402} with nothing running.

I also found that the node that had had the original problems has three
stuck fragments, all waiting on HdfsScanNode::GetNext:
https://gist.github.com/57a99bf4c00f1575810af924013c259d
Looking in the logs, I see continuous spew that it's trying to cancel the
fragment instances for this query, but apparently the cancellation is not
working. I talked to Michael Ho about this and it sounds like this is a
known issue with cancellation.

So, there are still two mysteries:
- why did it get stuck in the first place?
- why are my "number of running queries" counters stuck at non-zero values?

Does anything above ring a bell for anyone?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Issues with query/fragment lifecycle on trunk?

Posted by Bikramjeet Vig <bi...@cloudera.com>.
On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> I ran into some issues with a local core test run last night. This is on my
> own branch with some uncommitted work, but I haven't touched the backend
> and these issues seem to be backend-focused.
>
> Initially, I noticed the problem because a test run that I started last
> night was still "stuck" this morning. I had three queries which had been
> running for 10 hours and not making progress. The 'fragments' page for the
> query showed that one of the backends had not reported to the coordinator
> for many hours. In attempting to debug this, I managed to get the
> StateStore to declare that node dead, and those queries eventually were
> cancelled due to that.
>
> I resumed the node that I had been debugging with gdb, and it was declared
> live again. I didn't restart the process, though, which might have led to
> further issues:
>
> Next, I also noticed that /queries on my coordinator node showed "Number of
> running queries with fragments on this host" at 100+ on all three nodes,
> even though no queries were running anymore. These numbers are stable. I
> can continue to issue new queries and they complete normally. While a
> queries are running, the count of fragments goes up appropriate, and when
> the query finishes, it goes back down. But, the "base" numbers are stuck at
> {108, 391, 402} with nothing running.
>
> I also found that the node that had had the original problems has three
> stuck fragments, all waiting on HdfsScanNode::GetNext:
> https://gist.github.com/57a99bf4c00f1575810af924013c259d
> Looking in the logs, I see continuous spew that it's trying to cancel the
> fragment instances for this query, but apparently the cancellation is not
> working. I talked to Michael Ho about this and it sounds like this is a
> known issue with cancellation.
>
> So, there are still two mysteries:
> - why did it get stuck in the first place?
> * - why are my "number of running queries" counters stuck at non-zero
> values?*

This is definitely a bug, I can reliably recreate it by running a query
that is rejected by the admission controller. I have filed a jira for this (
IMPALA-7516 <https://issues.apache.org/jira/browse/IMPALA-7516>) and will
look more into it

>
>
> Does anything above ring a bell for anyone?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Issues with query/fragment lifecycle on trunk?

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Aug 30, 2018 at 2:48 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
> pooja.nilangekar@cloudera.com> wrote:
>
>> Hi Todd,
>>
>> I believe you are right. There are a couple of other race conditions in
>> the
>> HdfsScanNode which have resulted in broken-builds in the last couple of
>> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
>> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
>> invariants of the HdfsScanNode to find and fix other such races. I am
>> going
>> to create a JIRA to track it, I could add this issue to the list of races
>> to be fixed.  If this issue occurs more frequently, we could add a
>> temporary fix for now. Let me know what you'd suggest.
>>
>
> Makes sense. I'm currently trying to put together a simple test and fix
> that targets this race in particular, but definitely agree that the
> interactions between the scanner threads and scan node are hard to reason
> about and could be improved more generally.
>
> For now I'm trying to use a DebugAction to inject probabilistic failures
> into the soft memory limit checks to see if I can reproduce it more easily.
>

Put this up here with a fix: https://gerrit.cloudera.org/c/11369/

-Todd


>
> -Todd
>
>
>> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <to...@cloudera.com> wrote:
>>
>> > I spent another 45 minutes looking at the core and think I figured this
>> > out. I found that for the stuck scanner, both of its scanner threads and
>> > run and exited, but for some reason it was never marked as "done".
>> >
>> > I think this might be a regression due to IMPALA-7096
>> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
>> > following code:
>> >
>> >     // Stop extra threads if we're over a soft limit in order to free up
>> > memory.
>> >     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
>> {
>> >       break;
>> >     }
>> >
>> >     // Done with range and it completed successfully
>> >     if (progress_.done()) {
>> >       // All ranges are finished.  Indicate we are done.
>> >       SetDone();
>> >       break;
>> >     }
>> >
>> >     if (scan_range == nullptr && num_unqueued_files == 0) {
>> >       unique_lock<mutex> l(lock_);
>> >       // All ranges have been queued and DiskIoMgr has no more new
>> ranges
>> > for this scan
>> >       // node to process. This means that every range is either done or
>> > being processed by
>> >       // another thread.
>> >       all_ranges_started_ = true;
>> >       break;
>> >     }
>> >   }
>> >
>> > What if we have the following scenario:
>> >
>> > T1) grab scan range 1 and start processing
>> >
>> > T2) grab scan range 2 and start processing
>> >
>> > T1) finish scan range 1 and see that 'progress_' is not done()
>> > T1) loop around, get no scan range (there are no more), so set
>> > all_ranges_satrted_ and break
>> > T1) thread exits
>> >
>> > T2) finish scan range 2
>> > T2) happen to hit a soft memory limit error due to pressure from other
>> exec
>> > nodes, etc. Since we aren't the first thread, we break. (even though the
>> > first thread is no longer running)
>> > T2) thread exits
>> >
>> > Note that no one got to the point of calling SetDone() because we break
>> due
>> > to the memory limit error _before_ checking progress_.Done().
>> >
>> > Thus, the query will hang forever.
>> >
>> > It seems to me that the order of these conditions needs to be
>> rejiggered:
>> > the breaking due to memory limit should be the last thing in the loop so
>> > that, if we were the thread to finish the last scan range, we properly
>> > SetDone before exiting.
>> >
>> > Does that makes sense? If so, I'll file a JIRA. Suggestions also
>> welcome on
>> > how to write a test for this (do we have an easy facility to inject
>> > probabilistic mem-limit failures in that code path?)
>> >
>> > -Todd
>> >
>> >
>> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <kw...@cloudera.com> wrote:
>> >
>> > > FWIW, please see the discussion at
>> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck
>> IO"
>> > may
>> > > render cancellation in HDFS scan node ineffective. We have seen users
>> > > reported similar backtraces in the past due to hung RPCs to HDFS name
>> > node.
>> > > Not saying this is the necessarily the case here.
>> > >
>> > > A couple of things which may be interesting to look at:
>> > > - is there any scan node in the profile which doesn't finish any
>> assigned
>> > > scan ranges ?
>> > > - if you happen to have a core, it may help to inspect the stack
>> traces
>> > of
>> > > the scanner threads and the disk io mgr threads to understand their
>> > states.
>> > >
>> > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com>
>> wrote:
>> > >
>> > > > Hey folks,
>> > > >
>> > > > I ran into some issues with a local core test run last night. This
>> is
>> > on
>> > > my
>> > > > own branch with some uncommitted work, but I haven't touched the
>> > backend
>> > > > and these issues seem to be backend-focused.
>> > > >
>> > > > Initially, I noticed the problem because a test run that I started
>> last
>> > > > night was still "stuck" this morning. I had three queries which had
>> > been
>> > > > running for 10 hours and not making progress. The 'fragments' page
>> for
>> > > the
>> > > > query showed that one of the backends had not reported to the
>> > coordinator
>> > > > for many hours. In attempting to debug this, I managed to get the
>> > > > StateStore to declare that node dead, and those queries eventually
>> were
>> > > > cancelled due to that.
>> > > >
>> > > > I resumed the node that I had been debugging with gdb, and it was
>> > > declared
>> > > > live again. I didn't restart the process, though, which might have
>> led
>> > to
>> > > > further issues:
>> > > >
>> > > > Next, I also noticed that /queries on my coordinator node showed
>> > "Number
>> > > of
>> > > > running queries with fragments on this host" at 100+ on all three
>> > nodes,
>> > > > even though no queries were running anymore. These numbers are
>> stable.
>> > I
>> > > > can continue to issue new queries and they complete normally. While
>> a
>> > > > queries are running, the count of fragments goes up appropriate, and
>> > when
>> > > > the query finishes, it goes back down. But, the "base" numbers are
>> > stuck
>> > > at
>> > > > {108, 391, 402} with nothing running.
>> > > >
>> > > > I also found that the node that had had the original problems has
>> three
>> > > > stuck fragments, all waiting on HdfsScanNode::GetNext:
>> > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
>> > > > Looking in the logs, I see continuous spew that it's trying to
>> cancel
>> > the
>> > > > fragment instances for this query, but apparently the cancellation
>> is
>> > not
>> > > > working. I talked to Michael Ho about this and it sounds like this
>> is a
>> > > > known issue with cancellation.
>> > > >
>> > > > So, there are still two mysteries:
>> > > > - why did it get stuck in the first place?
>> > > > - why are my "number of running queries" counters stuck at non-zero
>> > > values?
>> > > >
>> > > > Does anything above ring a bell for anyone?
>> > > >
>> > > > -Todd
>> > > > --
>> > > > Todd Lipcon
>> > > > Software Engineer, Cloudera
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks,
>> > > Michael
>> > >
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Issues with query/fragment lifecycle on trunk?

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Aug 30, 2018 at 2:44 PM, Pooja Nilangekar <
pooja.nilangekar@cloudera.com> wrote:

> Hi Todd,
>
> I believe you are right. There are a couple of other race conditions in the
> HdfsScanNode which have resulted in broken-builds in the last couple of
> weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
> IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
> invariants of the HdfsScanNode to find and fix other such races. I am going
> to create a JIRA to track it, I could add this issue to the list of races
> to be fixed.  If this issue occurs more frequently, we could add a
> temporary fix for now. Let me know what you'd suggest.
>

Makes sense. I'm currently trying to put together a simple test and fix
that targets this race in particular, but definitely agree that the
interactions between the scanner threads and scan node are hard to reason
about and could be improved more generally.

For now I'm trying to use a DebugAction to inject probabilistic failures
into the soft memory limit checks to see if I can reproduce it more easily.

-Todd


> On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <to...@cloudera.com> wrote:
>
> > I spent another 45 minutes looking at the core and think I figured this
> > out. I found that for the stuck scanner, both of its scanner threads and
> > run and exited, but for some reason it was never marked as "done".
> >
> > I think this might be a regression due to IMPALA-7096
> > (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
> > following code:
> >
> >     // Stop extra threads if we're over a soft limit in order to free up
> > memory.
> >     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT))
> {
> >       break;
> >     }
> >
> >     // Done with range and it completed successfully
> >     if (progress_.done()) {
> >       // All ranges are finished.  Indicate we are done.
> >       SetDone();
> >       break;
> >     }
> >
> >     if (scan_range == nullptr && num_unqueued_files == 0) {
> >       unique_lock<mutex> l(lock_);
> >       // All ranges have been queued and DiskIoMgr has no more new ranges
> > for this scan
> >       // node to process. This means that every range is either done or
> > being processed by
> >       // another thread.
> >       all_ranges_started_ = true;
> >       break;
> >     }
> >   }
> >
> > What if we have the following scenario:
> >
> > T1) grab scan range 1 and start processing
> >
> > T2) grab scan range 2 and start processing
> >
> > T1) finish scan range 1 and see that 'progress_' is not done()
> > T1) loop around, get no scan range (there are no more), so set
> > all_ranges_satrted_ and break
> > T1) thread exits
> >
> > T2) finish scan range 2
> > T2) happen to hit a soft memory limit error due to pressure from other
> exec
> > nodes, etc. Since we aren't the first thread, we break. (even though the
> > first thread is no longer running)
> > T2) thread exits
> >
> > Note that no one got to the point of calling SetDone() because we break
> due
> > to the memory limit error _before_ checking progress_.Done().
> >
> > Thus, the query will hang forever.
> >
> > It seems to me that the order of these conditions needs to be rejiggered:
> > the breaking due to memory limit should be the last thing in the loop so
> > that, if we were the thread to finish the last scan range, we properly
> > SetDone before exiting.
> >
> > Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome
> on
> > how to write a test for this (do we have an easy facility to inject
> > probabilistic mem-limit failures in that code path?)
> >
> > -Todd
> >
> >
> > On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <kw...@cloudera.com> wrote:
> >
> > > FWIW, please see the discussion at
> > > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO"
> > may
> > > render cancellation in HDFS scan node ineffective. We have seen users
> > > reported similar backtraces in the past due to hung RPCs to HDFS name
> > node.
> > > Not saying this is the necessarily the case here.
> > >
> > > A couple of things which may be interesting to look at:
> > > - is there any scan node in the profile which doesn't finish any
> assigned
> > > scan ranges ?
> > > - if you happen to have a core, it may help to inspect the stack traces
> > of
> > > the scanner threads and the disk io mgr threads to understand their
> > states.
> > >
> > > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com>
> wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I ran into some issues with a local core test run last night. This is
> > on
> > > my
> > > > own branch with some uncommitted work, but I haven't touched the
> > backend
> > > > and these issues seem to be backend-focused.
> > > >
> > > > Initially, I noticed the problem because a test run that I started
> last
> > > > night was still "stuck" this morning. I had three queries which had
> > been
> > > > running for 10 hours and not making progress. The 'fragments' page
> for
> > > the
> > > > query showed that one of the backends had not reported to the
> > coordinator
> > > > for many hours. In attempting to debug this, I managed to get the
> > > > StateStore to declare that node dead, and those queries eventually
> were
> > > > cancelled due to that.
> > > >
> > > > I resumed the node that I had been debugging with gdb, and it was
> > > declared
> > > > live again. I didn't restart the process, though, which might have
> led
> > to
> > > > further issues:
> > > >
> > > > Next, I also noticed that /queries on my coordinator node showed
> > "Number
> > > of
> > > > running queries with fragments on this host" at 100+ on all three
> > nodes,
> > > > even though no queries were running anymore. These numbers are
> stable.
> > I
> > > > can continue to issue new queries and they complete normally. While a
> > > > queries are running, the count of fragments goes up appropriate, and
> > when
> > > > the query finishes, it goes back down. But, the "base" numbers are
> > stuck
> > > at
> > > > {108, 391, 402} with nothing running.
> > > >
> > > > I also found that the node that had had the original problems has
> three
> > > > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > > > Looking in the logs, I see continuous spew that it's trying to cancel
> > the
> > > > fragment instances for this query, but apparently the cancellation is
> > not
> > > > working. I talked to Michael Ho about this and it sounds like this
> is a
> > > > known issue with cancellation.
> > > >
> > > > So, there are still two mysteries:
> > > > - why did it get stuck in the first place?
> > > > - why are my "number of running queries" counters stuck at non-zero
> > > values?
> > > >
> > > > Does anything above ring a bell for anyone?
> > > >
> > > > -Todd
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> > >
> > > --
> > > Thanks,
> > > Michael
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Issues with query/fragment lifecycle on trunk?

Posted by Pooja Nilangekar <po...@cloudera.com>.
Hi Todd,

I believe you are right. There are a couple of other race conditions in the
HdfsScanNode which have resulted in broken-builds in the last couple of
weeks. (IMPALA-7418 and IMPALA-7335). While reviewing the change for
IMPALA-7335, Tim and Dan suggested that we need to cleanup the state
invariants of the HdfsScanNode to find and fix other such races. I am going
to create a JIRA to track it, I could add this issue to the list of races
to be fixed.  If this issue occurs more frequently, we could add a
temporary fix for now. Let me know what you'd suggest.

Thanks,
Pooja


On Thu, Aug 30, 2018 at 1:27 PM Todd Lipcon <to...@cloudera.com> wrote:

> I spent another 45 minutes looking at the core and think I figured this
> out. I found that for the stuck scanner, both of its scanner threads and
> run and exited, but for some reason it was never marked as "done".
>
> I think this might be a regression due to IMPALA-7096
> (7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
> following code:
>
>     // Stop extra threads if we're over a soft limit in order to free up
> memory.
>     if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
>       break;
>     }
>
>     // Done with range and it completed successfully
>     if (progress_.done()) {
>       // All ranges are finished.  Indicate we are done.
>       SetDone();
>       break;
>     }
>
>     if (scan_range == nullptr && num_unqueued_files == 0) {
>       unique_lock<mutex> l(lock_);
>       // All ranges have been queued and DiskIoMgr has no more new ranges
> for this scan
>       // node to process. This means that every range is either done or
> being processed by
>       // another thread.
>       all_ranges_started_ = true;
>       break;
>     }
>   }
>
> What if we have the following scenario:
>
> T1) grab scan range 1 and start processing
>
> T2) grab scan range 2 and start processing
>
> T1) finish scan range 1 and see that 'progress_' is not done()
> T1) loop around, get no scan range (there are no more), so set
> all_ranges_satrted_ and break
> T1) thread exits
>
> T2) finish scan range 2
> T2) happen to hit a soft memory limit error due to pressure from other exec
> nodes, etc. Since we aren't the first thread, we break. (even though the
> first thread is no longer running)
> T2) thread exits
>
> Note that no one got to the point of calling SetDone() because we break due
> to the memory limit error _before_ checking progress_.Done().
>
> Thus, the query will hang forever.
>
> It seems to me that the order of these conditions needs to be rejiggered:
> the breaking due to memory limit should be the last thing in the loop so
> that, if we were the thread to finish the last scan range, we properly
> SetDone before exiting.
>
> Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on
> how to write a test for this (do we have an easy facility to inject
> probabilistic mem-limit failures in that code path?)
>
> -Todd
>
>
> On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <kw...@cloudera.com> wrote:
>
> > FWIW, please see the discussion at
> > https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO"
> may
> > render cancellation in HDFS scan node ineffective. We have seen users
> > reported similar backtraces in the past due to hung RPCs to HDFS name
> node.
> > Not saying this is the necessarily the case here.
> >
> > A couple of things which may be interesting to look at:
> > - is there any scan node in the profile which doesn't finish any assigned
> > scan ranges ?
> > - if you happen to have a core, it may help to inspect the stack traces
> of
> > the scanner threads and the disk io mgr threads to understand their
> states.
> >
> > On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com> wrote:
> >
> > > Hey folks,
> > >
> > > I ran into some issues with a local core test run last night. This is
> on
> > my
> > > own branch with some uncommitted work, but I haven't touched the
> backend
> > > and these issues seem to be backend-focused.
> > >
> > > Initially, I noticed the problem because a test run that I started last
> > > night was still "stuck" this morning. I had three queries which had
> been
> > > running for 10 hours and not making progress. The 'fragments' page for
> > the
> > > query showed that one of the backends had not reported to the
> coordinator
> > > for many hours. In attempting to debug this, I managed to get the
> > > StateStore to declare that node dead, and those queries eventually were
> > > cancelled due to that.
> > >
> > > I resumed the node that I had been debugging with gdb, and it was
> > declared
> > > live again. I didn't restart the process, though, which might have led
> to
> > > further issues:
> > >
> > > Next, I also noticed that /queries on my coordinator node showed
> "Number
> > of
> > > running queries with fragments on this host" at 100+ on all three
> nodes,
> > > even though no queries were running anymore. These numbers are stable.
> I
> > > can continue to issue new queries and they complete normally. While a
> > > queries are running, the count of fragments goes up appropriate, and
> when
> > > the query finishes, it goes back down. But, the "base" numbers are
> stuck
> > at
> > > {108, 391, 402} with nothing running.
> > >
> > > I also found that the node that had had the original problems has three
> > > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > > Looking in the logs, I see continuous spew that it's trying to cancel
> the
> > > fragment instances for this query, but apparently the cancellation is
> not
> > > working. I talked to Michael Ho about this and it sounds like this is a
> > > known issue with cancellation.
> > >
> > > So, there are still two mysteries:
> > > - why did it get stuck in the first place?
> > > - why are my "number of running queries" counters stuck at non-zero
> > values?
> > >
> > > Does anything above ring a bell for anyone?
> > >
> > > -Todd
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> > --
> > Thanks,
> > Michael
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Issues with query/fragment lifecycle on trunk?

Posted by Todd Lipcon <to...@cloudera.com>.
I spent another 45 minutes looking at the core and think I figured this
out. I found that for the stuck scanner, both of its scanner threads and
run and exited, but for some reason it was never marked as "done".

I think this might be a regression due to IMPALA-7096
(7ccf7369085aa49a8fc0daf6f91d97b8a3135682). The scanner thread has the
following code:

    // Stop extra threads if we're over a soft limit in order to free up
memory.
    if (!first_thread && mem_tracker_->AnyLimitExceeded(MemLimit::SOFT)) {
      break;
    }

    // Done with range and it completed successfully
    if (progress_.done()) {
      // All ranges are finished.  Indicate we are done.
      SetDone();
      break;
    }

    if (scan_range == nullptr && num_unqueued_files == 0) {
      unique_lock<mutex> l(lock_);
      // All ranges have been queued and DiskIoMgr has no more new ranges
for this scan
      // node to process. This means that every range is either done or
being processed by
      // another thread.
      all_ranges_started_ = true;
      break;
    }
  }

What if we have the following scenario:

T1) grab scan range 1 and start processing

T2) grab scan range 2 and start processing

T1) finish scan range 1 and see that 'progress_' is not done()
T1) loop around, get no scan range (there are no more), so set
all_ranges_satrted_ and break
T1) thread exits

T2) finish scan range 2
T2) happen to hit a soft memory limit error due to pressure from other exec
nodes, etc. Since we aren't the first thread, we break. (even though the
first thread is no longer running)
T2) thread exits

Note that no one got to the point of calling SetDone() because we break due
to the memory limit error _before_ checking progress_.Done().

Thus, the query will hang forever.

It seems to me that the order of these conditions needs to be rejiggered:
the breaking due to memory limit should be the last thing in the loop so
that, if we were the thread to finish the last scan range, we properly
SetDone before exiting.

Does that makes sense? If so, I'll file a JIRA. Suggestions also welcome on
how to write a test for this (do we have an easy facility to inject
probabilistic mem-limit failures in that code path?)

-Todd


On Thu, Aug 30, 2018 at 12:35 PM, Michael Ho <kw...@cloudera.com> wrote:

> FWIW, please see the discussion at
> https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may
> render cancellation in HDFS scan node ineffective. We have seen users
> reported similar backtraces in the past due to hung RPCs to HDFS name node.
> Not saying this is the necessarily the case here.
>
> A couple of things which may be interesting to look at:
> - is there any scan node in the profile which doesn't finish any assigned
> scan ranges ?
> - if you happen to have a core, it may help to inspect the stack traces of
> the scanner threads and the disk io mgr threads to understand their states.
>
> On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com> wrote:
>
> > Hey folks,
> >
> > I ran into some issues with a local core test run last night. This is on
> my
> > own branch with some uncommitted work, but I haven't touched the backend
> > and these issues seem to be backend-focused.
> >
> > Initially, I noticed the problem because a test run that I started last
> > night was still "stuck" this morning. I had three queries which had been
> > running for 10 hours and not making progress. The 'fragments' page for
> the
> > query showed that one of the backends had not reported to the coordinator
> > for many hours. In attempting to debug this, I managed to get the
> > StateStore to declare that node dead, and those queries eventually were
> > cancelled due to that.
> >
> > I resumed the node that I had been debugging with gdb, and it was
> declared
> > live again. I didn't restart the process, though, which might have led to
> > further issues:
> >
> > Next, I also noticed that /queries on my coordinator node showed "Number
> of
> > running queries with fragments on this host" at 100+ on all three nodes,
> > even though no queries were running anymore. These numbers are stable. I
> > can continue to issue new queries and they complete normally. While a
> > queries are running, the count of fragments goes up appropriate, and when
> > the query finishes, it goes back down. But, the "base" numbers are stuck
> at
> > {108, 391, 402} with nothing running.
> >
> > I also found that the node that had had the original problems has three
> > stuck fragments, all waiting on HdfsScanNode::GetNext:
> > https://gist.github.com/57a99bf4c00f1575810af924013c259d
> > Looking in the logs, I see continuous spew that it's trying to cancel the
> > fragment instances for this query, but apparently the cancellation is not
> > working. I talked to Michael Ho about this and it sounds like this is a
> > known issue with cancellation.
> >
> > So, there are still two mysteries:
> > - why did it get stuck in the first place?
> > - why are my "number of running queries" counters stuck at non-zero
> values?
> >
> > Does anything above ring a bell for anyone?
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
> --
> Thanks,
> Michael
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Issues with query/fragment lifecycle on trunk?

Posted by Michael Ho <kw...@cloudera.com>.
FWIW, please see the discussion at
https://issues.apache.org/jira/browse/IMPALA-6194 about how "stuck IO" may
render cancellation in HDFS scan node ineffective. We have seen users
reported similar backtraces in the past due to hung RPCs to HDFS name node.
Not saying this is the necessarily the case here.

A couple of things which may be interesting to look at:
- is there any scan node in the profile which doesn't finish any assigned
scan ranges ?
- if you happen to have a core, it may help to inspect the stack traces of
the scanner threads and the disk io mgr threads to understand their states.

On Thu, Aug 30, 2018 at 12:25 PM Todd Lipcon <to...@cloudera.com> wrote:

> Hey folks,
>
> I ran into some issues with a local core test run last night. This is on my
> own branch with some uncommitted work, but I haven't touched the backend
> and these issues seem to be backend-focused.
>
> Initially, I noticed the problem because a test run that I started last
> night was still "stuck" this morning. I had three queries which had been
> running for 10 hours and not making progress. The 'fragments' page for the
> query showed that one of the backends had not reported to the coordinator
> for many hours. In attempting to debug this, I managed to get the
> StateStore to declare that node dead, and those queries eventually were
> cancelled due to that.
>
> I resumed the node that I had been debugging with gdb, and it was declared
> live again. I didn't restart the process, though, which might have led to
> further issues:
>
> Next, I also noticed that /queries on my coordinator node showed "Number of
> running queries with fragments on this host" at 100+ on all three nodes,
> even though no queries were running anymore. These numbers are stable. I
> can continue to issue new queries and they complete normally. While a
> queries are running, the count of fragments goes up appropriate, and when
> the query finishes, it goes back down. But, the "base" numbers are stuck at
> {108, 391, 402} with nothing running.
>
> I also found that the node that had had the original problems has three
> stuck fragments, all waiting on HdfsScanNode::GetNext:
> https://gist.github.com/57a99bf4c00f1575810af924013c259d
> Looking in the logs, I see continuous spew that it's trying to cancel the
> fragment instances for this query, but apparently the cancellation is not
> working. I talked to Michael Ho about this and it sounds like this is a
> known issue with cancellation.
>
> So, there are still two mysteries:
> - why did it get stuck in the first place?
> - why are my "number of running queries" counters stuck at non-zero values?
>
> Does anything above ring a bell for anyone?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


-- 
Thanks,
Michael