You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2019/08/23 23:39:00 UTC

[jira] [Comment Edited] (IMPALA-8890) DCHECK(!page->attached_to_output_batch) in SpillableRowBatchQueue::AddBatch

    [ https://issues.apache.org/jira/browse/IMPALA-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914715#comment-16914715 ] 

Sahil Takiar edited comment on IMPALA-8890 at 8/23/19 11:38 PM:
----------------------------------------------------------------

Can re-produce this pretty consistently now (same setup as above but added some additional logging and changed the mini cluster to have 1 dedicated coordinator and 3 executors). Here is what I found out so far:
 * It doesn't *look* like a race condition
 * It seems to only happen when:
 ** Rows are being added to the {{BufferedTupleStream}} successfully, until: {{SpillableRowBatchQueue::GetBatch}} >>> {{BufferedTupleStream::GetNext}} >>> {{read_page_->AttachBufferToBatch}}
 ** Then, without any additional calls to {{BufferedTupleStream::GetNext}} (I think this part may be relevant because everything works if there are additional calls to {{GetNext}}), {{SpillableRowBatchQueue::AddBatch}} >>> {{BufferedTupleStream::AddRow}} is called repeatedly (for multiple {{RowBatch}}-es)
 ** This continues until eventually {{BufferedTupleStream::AddRow}} returns false (presumably because the reservation limits have been hit), and then {{BufferedTupleStream::UnpinStream}} is called, which eventually hits the DCHECK above
 * The DCHECK is hit because:
 ** Looking at the state of the {{Page}}-s in the {{BufferedTupleStream}} it looks like the last call to {{BufferedTupleStream::GetNext}} calls {{BufferedTupleStream::AttachBufferToBatch}} on the {{read_page_}} which sets {{attached_to_output_batch}} to true for the {{Page}}
 ** Then {{UnpinStream}} is called, iterates through all the {{pages_}} and sees that the {{read_page_}} has {{attached_to_output_batch}} set to true and then fails (I confirmed through logging that it fails specifically on the {{read_page_}} that had {{attached_to_output_batch}} set to true above)
 ** *If* there had been an additional call to {{GetNext}} then {{NextReadPage()}} would have been called which was calls {{pages_.pop_front()}} and removes the {{read_page_}} with {{attached_to_output_batch}} set to true from the list of pages

So *I think* this is a bug in {{BufferedTupleStream}}, unless there is something off with the way {{SpillableRowBatchQueue}} is using {{BufferedTupleStream}}, wondering what [~tarmstrong@cloudera.com] thinks?

Here is a snippet of the modified logs that may shows things more clearly:
{code:java}
I0823 16:08:32.386766 33770 buffered-plan-root-sink.cc:169] Getting Batch
I0823 16:08:32.388576 33770 buffered-plan-root-sink.cc:169] Getting Batch
...
I0823 16:08:32.394234 33770 buffered-tuple-stream.cc:804] Calling AttachBufferToBatch
I0823 16:08:32.394240 33770 buffered-tuple-stream.cc:204] Setting attached_to_output_batch to true for page 0x18cc8880
I0823 16:08:32.394279 33770 buffered-plan-root-sink.cc:209] Returning rows = 32768
I0823 16:08:32.394289 33770 impala-hs2-server.cc:842] FetchResults(): #results=0 has_more=true
I0823 16:08:32.394300 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
I0823 16:08:32.395067 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
...
I0823 16:08:32.431181 33781 spillable-row-batch-queue.cc:79] f348799ab855e68e:697ddff100000000] SpillableRowBatchQueue about to start spilling BufferedTupleStream num_rows=1152120 rows_returned=557901 pinned=1 attach_on_read=1 closed=0
 bytes_pinned=102760448 has_write_iterator=1 write_page=0x18dfd510 has_read_iterator=1 read_page=0x18cc8880
 read_page_reservation=0 write_page_reservation=0
 # pages=50 pages=[
{<BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true},
{<BufferPool::PageHandle> 0x1ba6e100 client: 0x13616b88/0x13b921c0 page: {<BufferPool::Page> 0x15927f40 len: 2097152 pin_count: 1 buf: <BufferPool::BufferHandle> 0x15927fb8 client: 0x13616b88/0x13b921c0 data: 0x27400000 len: 2097152} num_rows=12107 retrived_buffer=true attached_to_output_batch=false},
...
0823 16:08:32.431262 33781 buffered-tuple-stream.cc:292] f348799ab855e68e:697ddff100000000] Check failed: !page->attached_to_output_batch check failed for page = <BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true{code}


was (Author: stakiar):
Can re-produce this pretty consistently now (same setup as above but added some additional logging and changed the mini cluster to have 1 dedicated coordinator and 3 executors). Here is what I found out so far:
 * It doesn't *look* like a race condition
 * It seems to only happen when:
 ** Rows are being added to the {{BufferedTupleStream}} successfully, until: {{SpillableRowBatchQueue::GetBatch}} --> {{BufferedTupleStream::GetNext}} --> {{read_page_->AttachBufferToBatch}}
 ** Then, without any additional calls to {{BufferedTupleStream::GetNext}} (I think this part may be relevant because everything works if there are additional calls to {{GetNext}}), {{SpillableRowBatchQueue::AddBatch}} --> {{BufferedTupleStream::AddRow}} is called repeatedly (for multiple {{RowBatch}}-es)
 ** This continues until eventually {{BufferedTupleStream::AddRow}} returns false (presumably because the reservation limits have been hit), and then {{BufferedTupleStream::UnpinStream}} is called, which eventually hits the DCHECK above
 * The DCHECK is hit because:
 ** Looking at the state of the {{Page}}-s in the {{BufferedTupleStream}} it looks like the last call to {{BufferedTupleStream::GetNext}} calls {{BufferedTupleStream::AttachBufferToBatch}} on the {{read_page_}} which sets {{attached_to_output_batch}} to true for the {{Page}}
 ** Then {{UnpinStream}} is called, iterates through all the {{pages_}} and sees that the {{read_page_}} has {{attached_to_output_batch}} set to true and then fails (I confirmed through logging that it fails specifically on the {{read_page_}} that had {{attached_to_output_batch}} set to true above)
 ** *If* there had been an additional call to {{GetNext}} then {{NextReadPage()}} would have been called which was calls {{pages_.pop_front()}} and removes the {{read_page_}} with {{attached_to_output_batch}} set to true from the list of pages

So *I think* this is a bug in {{BufferedTupleStream}}, unless there is something off with the way {{SpillableRowBatchQueue}} is using {{BufferedTupleStream}}, wondering what [~tarmstrong@cloudera.com] thinks?

Here is a snippet of the modified logs that may shows things more clearly:
{code:java}
I0823 16:08:32.386766 33770 buffered-plan-root-sink.cc:169] Getting Batch
I0823 16:08:32.388576 33770 buffered-plan-root-sink.cc:169] Getting Batch
...
I0823 16:08:32.394234 33770 buffered-tuple-stream.cc:804] Calling AttachBufferToBatch
I0823 16:08:32.394240 33770 buffered-tuple-stream.cc:204] Setting attached_to_output_batch to true for page 0x18cc8880
I0823 16:08:32.394279 33770 buffered-plan-root-sink.cc:209] Returning rows = 32768
I0823 16:08:32.394289 33770 impala-hs2-server.cc:842] FetchResults(): #results=0 has_more=true
I0823 16:08:32.394300 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
I0823 16:08:32.395067 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
...
I0823 16:08:32.431181 33781 spillable-row-batch-queue.cc:79] f348799ab855e68e:697ddff100000000] SpillableRowBatchQueue about to start spilling BufferedTupleStream num_rows=1152120 rows_returned=557901 pinned=1 attach_on_read=1 closed=0
 bytes_pinned=102760448 has_write_iterator=1 write_page=0x18dfd510 has_read_iterator=1 read_page=0x18cc8880
 read_page_reservation=0 write_page_reservation=0
 # pages=50 pages=[
{<BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true},
{<BufferPool::PageHandle> 0x1ba6e100 client: 0x13616b88/0x13b921c0 page: {<BufferPool::Page> 0x15927f40 len: 2097152 pin_count: 1 buf: <BufferPool::BufferHandle> 0x15927fb8 client: 0x13616b88/0x13b921c0 data: 0x27400000 len: 2097152} num_rows=12107 retrived_buffer=true attached_to_output_batch=false},
...
0823 16:08:32.431262 33781 buffered-tuple-stream.cc:292] f348799ab855e68e:697ddff100000000] Check failed: !page->attached_to_output_batch check failed for page = <BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true{code}

> DCHECK(!page->attached_to_output_batch) in SpillableRowBatchQueue::AddBatch 
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-8890
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8890
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Impala 3.4.0
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Blocker
>         Attachments: impalad.INFO, resolved.txt
>
>
> Full stack:
> {code}
> F0823 13:19:42.262142 60340 buffered-tuple-stream.cc:291] 6a4941285b46788d:68021ec600000000] Check failed: !page->attached_to_output_batch
> *** Check failure stack trace: ***
>     @          0x4c987cc  google::LogMessage::Fail()
>     @          0x4c9a071  google::LogMessage::SendToLog()
>     @          0x4c981a6  google::LogMessage::Flush()
>     @          0x4c9b76d  google::LogMessageFatal::~LogMessageFatal()
>     @          0x2917f78  impala::BufferedTupleStream::ExpectedPinCount()
>     @          0x29181ec  impala::BufferedTupleStream::UnpinPageIfNeeded()
>     @          0x291b27b  impala::BufferedTupleStream::UnpinStream()
>     @          0x297d429  impala::SpillableRowBatchQueue::AddBatch()
>     @          0x25d5537  impala::BufferedPlanRootSink::Send()
>     @          0x207e94c  impala::FragmentInstanceState::ExecInternal()
>     @          0x207afac  impala::FragmentInstanceState::Exec()
>     @          0x208e854  impala::QueryState::ExecFInstance()
>     @          0x208cb21  _ZZN6impala10QueryState15StartFInstancesEvENKUlvE_clEv
>     @          0x2090536  _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala10QueryState15StartFInstancesEvEUlvE_vE6invokeERNS1_15function_bufferE
>     @          0x1e9830b  boost::function0<>::operator()()
>     @          0x23e2d38  impala::Thread::SuperviseThread()
>     @          0x23eb0bc  boost::_bi::list5<>::operator()<>()
>     @          0x23eafe0  boost::_bi::bind_t<>::operator()()
>     @          0x23eafa3  boost::detail::thread_data<>::run()
>     @          0x3bc1629  thread_proxy
>     @     0x7f920a3786b9  start_thread
>     @     0x7f9206b5741c  clone
> {code}
> Happened once while I was running a full table scan of {{tpch_parquet.orders}} via JDBC (client was running on another EC2 machine). This was running on top of IMPALA-8819 with a fetch size of 32768.
> Attached full logs and mini-dump stack.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org