You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2019/08/23 23:39:00 UTC
[jira] [Comment Edited] (IMPALA-8890)
DCHECK(!page->attached_to_output_batch) in SpillableRowBatchQueue::AddBatch
[ https://issues.apache.org/jira/browse/IMPALA-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914715#comment-16914715 ]
Sahil Takiar edited comment on IMPALA-8890 at 8/23/19 11:38 PM:
----------------------------------------------------------------
Can re-produce this pretty consistently now (same setup as above but added some additional logging and changed the mini cluster to have 1 dedicated coordinator and 3 executors). Here is what I found out so far:
* It doesn't *look* like a race condition
* It seems to only happen when:
** Rows are being added to the {{BufferedTupleStream}} successfully, until: {{SpillableRowBatchQueue::GetBatch}} >>> {{BufferedTupleStream::GetNext}} >>> {{read_page_->AttachBufferToBatch}}
** Then, without any additional calls to {{BufferedTupleStream::GetNext}} (I think this part may be relevant because everything works if there are additional calls to {{GetNext}}), {{SpillableRowBatchQueue::AddBatch}} >>> {{BufferedTupleStream::AddRow}} is called repeatedly (for multiple {{RowBatch}}-es)
** This continues until eventually {{BufferedTupleStream::AddRow}} returns false (presumably because the reservation limits have been hit), and then {{BufferedTupleStream::UnpinStream}} is called, which eventually hits the DCHECK above
* The DCHECK is hit because:
** Looking at the state of the {{Page}}-s in the {{BufferedTupleStream}} it looks like the last call to {{BufferedTupleStream::GetNext}} calls {{BufferedTupleStream::AttachBufferToBatch}} on the {{read_page_}} which sets {{attached_to_output_batch}} to true for the {{Page}}
** Then {{UnpinStream}} is called, iterates through all the {{pages_}} and sees that the {{read_page_}} has {{attached_to_output_batch}} set to true and then fails (I confirmed through logging that it fails specifically on the {{read_page_}} that had {{attached_to_output_batch}} set to true above)
** *If* there had been an additional call to {{GetNext}} then {{NextReadPage()}} would have been called which was calls {{pages_.pop_front()}} and removes the {{read_page_}} with {{attached_to_output_batch}} set to true from the list of pages
So *I think* this is a bug in {{BufferedTupleStream}}, unless there is something off with the way {{SpillableRowBatchQueue}} is using {{BufferedTupleStream}}, wondering what [~tarmstrong@cloudera.com] thinks?
Here is a snippet of the modified logs that may shows things more clearly:
{code:java}
I0823 16:08:32.386766 33770 buffered-plan-root-sink.cc:169] Getting Batch
I0823 16:08:32.388576 33770 buffered-plan-root-sink.cc:169] Getting Batch
...
I0823 16:08:32.394234 33770 buffered-tuple-stream.cc:804] Calling AttachBufferToBatch
I0823 16:08:32.394240 33770 buffered-tuple-stream.cc:204] Setting attached_to_output_batch to true for page 0x18cc8880
I0823 16:08:32.394279 33770 buffered-plan-root-sink.cc:209] Returning rows = 32768
I0823 16:08:32.394289 33770 impala-hs2-server.cc:842] FetchResults(): #results=0 has_more=true
I0823 16:08:32.394300 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
I0823 16:08:32.395067 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
...
I0823 16:08:32.431181 33781 spillable-row-batch-queue.cc:79] f348799ab855e68e:697ddff100000000] SpillableRowBatchQueue about to start spilling BufferedTupleStream num_rows=1152120 rows_returned=557901 pinned=1 attach_on_read=1 closed=0
bytes_pinned=102760448 has_write_iterator=1 write_page=0x18dfd510 has_read_iterator=1 read_page=0x18cc8880
read_page_reservation=0 write_page_reservation=0
# pages=50 pages=[
{<BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true},
{<BufferPool::PageHandle> 0x1ba6e100 client: 0x13616b88/0x13b921c0 page: {<BufferPool::Page> 0x15927f40 len: 2097152 pin_count: 1 buf: <BufferPool::BufferHandle> 0x15927fb8 client: 0x13616b88/0x13b921c0 data: 0x27400000 len: 2097152} num_rows=12107 retrived_buffer=true attached_to_output_batch=false},
...
0823 16:08:32.431262 33781 buffered-tuple-stream.cc:292] f348799ab855e68e:697ddff100000000] Check failed: !page->attached_to_output_batch check failed for page = <BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true{code}
was (Author: stakiar):
Can re-produce this pretty consistently now (same setup as above but added some additional logging and changed the mini cluster to have 1 dedicated coordinator and 3 executors). Here is what I found out so far:
* It doesn't *look* like a race condition
* It seems to only happen when:
** Rows are being added to the {{BufferedTupleStream}} successfully, until: {{SpillableRowBatchQueue::GetBatch}} --> {{BufferedTupleStream::GetNext}} --> {{read_page_->AttachBufferToBatch}}
** Then, without any additional calls to {{BufferedTupleStream::GetNext}} (I think this part may be relevant because everything works if there are additional calls to {{GetNext}}), {{SpillableRowBatchQueue::AddBatch}} --> {{BufferedTupleStream::AddRow}} is called repeatedly (for multiple {{RowBatch}}-es)
** This continues until eventually {{BufferedTupleStream::AddRow}} returns false (presumably because the reservation limits have been hit), and then {{BufferedTupleStream::UnpinStream}} is called, which eventually hits the DCHECK above
* The DCHECK is hit because:
** Looking at the state of the {{Page}}-s in the {{BufferedTupleStream}} it looks like the last call to {{BufferedTupleStream::GetNext}} calls {{BufferedTupleStream::AttachBufferToBatch}} on the {{read_page_}} which sets {{attached_to_output_batch}} to true for the {{Page}}
** Then {{UnpinStream}} is called, iterates through all the {{pages_}} and sees that the {{read_page_}} has {{attached_to_output_batch}} set to true and then fails (I confirmed through logging that it fails specifically on the {{read_page_}} that had {{attached_to_output_batch}} set to true above)
** *If* there had been an additional call to {{GetNext}} then {{NextReadPage()}} would have been called which was calls {{pages_.pop_front()}} and removes the {{read_page_}} with {{attached_to_output_batch}} set to true from the list of pages
So *I think* this is a bug in {{BufferedTupleStream}}, unless there is something off with the way {{SpillableRowBatchQueue}} is using {{BufferedTupleStream}}, wondering what [~tarmstrong@cloudera.com] thinks?
Here is a snippet of the modified logs that may shows things more clearly:
{code:java}
I0823 16:08:32.386766 33770 buffered-plan-root-sink.cc:169] Getting Batch
I0823 16:08:32.388576 33770 buffered-plan-root-sink.cc:169] Getting Batch
...
I0823 16:08:32.394234 33770 buffered-tuple-stream.cc:804] Calling AttachBufferToBatch
I0823 16:08:32.394240 33770 buffered-tuple-stream.cc:204] Setting attached_to_output_batch to true for page 0x18cc8880
I0823 16:08:32.394279 33770 buffered-plan-root-sink.cc:209] Returning rows = 32768
I0823 16:08:32.394289 33770 impala-hs2-server.cc:842] FetchResults(): #results=0 has_more=true
I0823 16:08:32.394300 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
I0823 16:08:32.395067 33781 buffered-plan-root-sink.cc:77] f348799ab855e68e:697ddff100000000] Adding Batch
...
I0823 16:08:32.431181 33781 spillable-row-batch-queue.cc:79] f348799ab855e68e:697ddff100000000] SpillableRowBatchQueue about to start spilling BufferedTupleStream num_rows=1152120 rows_returned=557901 pinned=1 attach_on_read=1 closed=0
bytes_pinned=102760448 has_write_iterator=1 write_page=0x18dfd510 has_read_iterator=1 read_page=0x18cc8880
read_page_reservation=0 write_page_reservation=0
# pages=50 pages=[
{<BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true},
{<BufferPool::PageHandle> 0x1ba6e100 client: 0x13616b88/0x13b921c0 page: {<BufferPool::Page> 0x15927f40 len: 2097152 pin_count: 1 buf: <BufferPool::BufferHandle> 0x15927fb8 client: 0x13616b88/0x13b921c0 data: 0x27400000 len: 2097152} num_rows=12107 retrived_buffer=true attached_to_output_batch=false},
...
0823 16:08:32.431262 33781 buffered-tuple-stream.cc:292] f348799ab855e68e:697ddff100000000] Check failed: !page->attached_to_output_batch check failed for page = <BufferPool::PageHandle> 0x18cc8880 CLOSED num_rows=12123 retrived_buffer=true attached_to_output_batch=true{code}
> DCHECK(!page->attached_to_output_batch) in SpillableRowBatchQueue::AddBatch
> ----------------------------------------------------------------------------
>
> Key: IMPALA-8890
> URL: https://issues.apache.org/jira/browse/IMPALA-8890
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Affects Versions: Impala 3.4.0
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Blocker
> Attachments: impalad.INFO, resolved.txt
>
>
> Full stack:
> {code}
> F0823 13:19:42.262142 60340 buffered-tuple-stream.cc:291] 6a4941285b46788d:68021ec600000000] Check failed: !page->attached_to_output_batch
> *** Check failure stack trace: ***
> @ 0x4c987cc google::LogMessage::Fail()
> @ 0x4c9a071 google::LogMessage::SendToLog()
> @ 0x4c981a6 google::LogMessage::Flush()
> @ 0x4c9b76d google::LogMessageFatal::~LogMessageFatal()
> @ 0x2917f78 impala::BufferedTupleStream::ExpectedPinCount()
> @ 0x29181ec impala::BufferedTupleStream::UnpinPageIfNeeded()
> @ 0x291b27b impala::BufferedTupleStream::UnpinStream()
> @ 0x297d429 impala::SpillableRowBatchQueue::AddBatch()
> @ 0x25d5537 impala::BufferedPlanRootSink::Send()
> @ 0x207e94c impala::FragmentInstanceState::ExecInternal()
> @ 0x207afac impala::FragmentInstanceState::Exec()
> @ 0x208e854 impala::QueryState::ExecFInstance()
> @ 0x208cb21 _ZZN6impala10QueryState15StartFInstancesEvENKUlvE_clEv
> @ 0x2090536 _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala10QueryState15StartFInstancesEvEUlvE_vE6invokeERNS1_15function_bufferE
> @ 0x1e9830b boost::function0<>::operator()()
> @ 0x23e2d38 impala::Thread::SuperviseThread()
> @ 0x23eb0bc boost::_bi::list5<>::operator()<>()
> @ 0x23eafe0 boost::_bi::bind_t<>::operator()()
> @ 0x23eafa3 boost::detail::thread_data<>::run()
> @ 0x3bc1629 thread_proxy
> @ 0x7f920a3786b9 start_thread
> @ 0x7f9206b5741c clone
> {code}
> Happened once while I was running a full table scan of {{tpch_parquet.orders}} via JDBC (client was running on another EC2 machine). This was running on top of IMPALA-8819 with a fetch size of 32768.
> Attached full logs and mini-dump stack.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org