You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Andrew Wong (Jira)" <ji...@apache.org> on 2021/03/23 20:16:00 UTC
[jira] [Updated] (KUDU-3268) Crash in TabletServerDiskErrorTest.TestRandomOpSequence

     [ https://issues.apache.org/jira/browse/KUDU-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Wong updated KUDU-3268:
------------------------------
    Description: 
A pre-commit failed with the following crash when attempting to launch an op after stopping a replica:
{code:java}
I0323 18:15:01.078991 23854 maintenance_manager.cc:373] P c8a93089db0041f5930b9fb1832714ed: Scheduling CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452
I0323 18:15:01.079111 21067 tablet_server-test.cc:852] Tablet server responded with: timestamp: 6621279441214984192
I0323 18:15:01.079317 23789 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: UndoDeltaBlockGCOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.000s	user 0.000s	sys 0.000s Metrics: {"cfile_init":1,"lbm_read_time_us":73,"lbm_reads_lt_1ms":4}
E0323 18:15:01.080865 23788 cfile_reader.cc:591] Encountered corrupted CFile in filesystem block: 4124746176525068430
I0323 18:15:01.080960 23788 ts_tablet_manager.cc:1774] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: failing tablet
I0323 18:15:01.080950 21067 tablet_server-test.cc:852] Tablet server responded with: timestamp: 6621279441223315456
I0323 18:15:01.081243 24138 tablet_replica.cc:324] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: stopping tablet replica
I0323 18:15:01.081670 21067 tablet_server-test.cc:852] Tablet server responded with: error {
  code: TABLET_NOT_RUNNING
  status {
    code: ILLEGAL_STATE
    message: "Tablet not RUNNING: STOPPING"
  }
}
I0323 18:15:01.081777 21067 tablet_server-test.cc:890] Failure was caught by an op!
W0323 18:15:01.082907 23788 tablet_mm_ops.cc:176] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: Compaction failed on ffffffffffffffffffffffffffffffff: Corruption: Flush to disk failed: checksum error on CFile block 4124746176525068430 at offset=1006 size=24: Checksum does not match: 3582029077 vs expected 3582029077
I0323 18:15:01.082957 23788 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: CompactRowSetsOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.004s	user 0.003s	sys 0.000s Metrics: {"cfile_cache_miss":3,"cfile_cache_miss_bytes":92,"delta_iterators_relevant":2,"dirs.queue_time_us":630,"dirs.run_cpu_time_us":368,"dirs.run_wall_time_us":2220,"lbm_read_time_us":54,"lbm_reads_lt_1ms":3,"lbm_write_time_us":168,"lbm_writes_lt_1ms":6,"num_input_rowsets":2,"spinlock_wait_cycles":1792,"tablet-open.queue_time_us":135,"thread_start_us":382,"threads_started":5}
I0323 18:15:01.083369 23854 maintenance_manager.cc:373] P c8a93089db0041f5930b9fb1832714ed: Scheduling CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452
*** Aborted at 1616523301 (unix time) try "date -d @1616523301" if you are using GNU date ***
I0323 18:15:01.083519 24138 raft_consensus.cc:2226] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 LEADER]: Raft consensus shutting down.
I0323 18:15:01.083653 24138 raft_consensus.cc:2255] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 FOLLOWER]: Raft consensus is shut down!
I0323 18:15:01.085090 21067 tablet_server-test.cc:894] Tablet was successfully failed
I0323 18:15:01.085439 21067 tablet_server.cc:166] TabletServer@127.0.0.1:0 shutting down...
PC: @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp()
*** SIGSEGV (@0x30) received by PID 21067 (TID 0x7ff96343b700) from PID 48; stack trace: ***
    @     0x7ff976846980 (unknown) at ??:0
    @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() at ??:0
    @     0x7ff97596b538 _ZZN4kudu18MaintenanceManager18RunSchedulerThreadEvENKUlvE_clEv at ??:0
    @     0x7ff97596f124 _ZNSt17_Function_handlerIFvvEZN4kudu18MaintenanceManager18RunSchedulerThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0
    @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0
    @     0x7ff975a05e6e kudu::ThreadPool::DispatchThread() at ??:0
    @     0x7ff975a06757 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv at ??:0
    @     0x7ff975a07e7b _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0
    @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0
    @     0x7ff9759f8913 kudu::Thread::SuperviseThread() at ??:0
    @     0x7ff97683b6db start_thread at ??:0
    @     0x7ff97388e71f clone at ??:0
{code}

Attached the full logs -- seems there's something unsafe about how we unregister ops (maybe from the fix for KUDU-3149?) when racing with the scheduler thread.


  was:
A pre-commit failed with the following crash when attempting to launch an op after stopping a replica:
{code:java}
I0323 18:15:01.079317 23789 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: UndoDeltaBlockGCOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.000s user 0.000s sys 0.000s Metrics: {"cfile_init":1,"lbm_read_time_us":73,"lbm_reads_lt_1ms":4}I0323 18:15:01.079317 23789 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: UndoDeltaBlockGCOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.000s user 0.000s sys 0.000s Metrics: {"cfile_init":1,"lbm_read_time_us":73,"lbm_reads_lt_1ms":4}E0323 18:15:01.080865 23788 cfile_reader.cc:591] Encountered corrupted CFile in filesystem block: 4124746176525068430I0323 18:15:01.080960 23788 ts_tablet_manager.cc:1774] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: failing tabletI0323 18:15:01.080950 21067 tablet_server-test.cc:852] Tablet server responded with: timestamp: 6621279441223315456I0323 18:15:01.081243 24138 tablet_replica.cc:324] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: stopping tablet replicaI0323 18:15:01.081670 21067 tablet_server-test.cc:852] Tablet server responded with: error {  code: TABLET_NOT_RUNNING  status {    code: ILLEGAL_STATE    message: "Tablet not RUNNING: STOPPING"  }}I0323 18:15:01.081777 21067 tablet_server-test.cc:890] Failure was caught by an op!W0323 18:15:01.082907 23788 tablet_mm_ops.cc:176] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: Compaction failed on ffffffffffffffffffffffffffffffff: Corruption: Flush to disk failed: checksum error on CFile block 4124746176525068430 at offset=1006 size=24: Checksum does not match: 3582029077 vs expected 3582029077I0323 18:15:01.082957 23788 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: CompactRowSetsOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.004s user 0.003s sys 0.000s Metrics: {"cfile_cache_miss":3,"cfile_cache_miss_bytes":92,"delta_iterators_relevant":2,"dirs.queue_time_us":630,"dirs.run_cpu_time_us":368,"dirs.run_wall_time_us":2220,"lbm_read_time_us":54,"lbm_reads_lt_1ms":3,"lbm_write_time_us":168,"lbm_writes_lt_1ms":6,"num_input_rowsets":2,"spinlock_wait_cycles":1792,"tablet-open.queue_time_us":135,"thread_start_us":382,"threads_started":5}I0323 18:15:01.083369 23854 maintenance_manager.cc:373] P c8a93089db0041f5930b9fb1832714ed: Scheduling CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452*** Aborted at 1616523301 (unix time) try "date -d @1616523301" if you are using GNU date ***I0323 18:15:01.083519 24138 raft_consensus.cc:2226] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 LEADER]: Raft consensus shutting down.I0323 18:15:01.083653 24138 raft_consensus.cc:2255] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 FOLLOWER]: Raft consensus is shut down!I0323 18:15:01.085090 21067 tablet_server-test.cc:894] Tablet was successfully failedI0323 18:15:01.085439 21067 tablet_server.cc:166] TabletServer@127.0.0.1:0 shutting down...PC: @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp()*** SIGSEGV (@0x30) received by PID 21067 (TID 0x7ff96343b700) from PID 48; stack trace: ***    @     0x7ff976846980 (unknown) at ??:0    @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() at ??:0    @     0x7ff97596b538 _ZZN4kudu18MaintenanceManager18RunSchedulerThreadEvENKUlvE_clEv at ??:0    @     0x7ff97596f124 _ZNSt17_Function_handlerIFvvEZN4kudu18MaintenanceManager18RunSchedulerThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0    @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0    @     0x7ff975a05e6e kudu::ThreadPool::DispatchThread() at ??:0    @     0x7ff975a06757 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv at ??:0    @     0x7ff975a07e7b _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0    @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0    @     0x7ff9759f8913 kudu::Thread::SuperviseThread() at ??:0    @     0x7ff97683b6db start_thread at ??:0    @     0x7ff97388e71f clone at ??:0 {code}

Attached the full logs -- seems there's something unsafe about how we unregister ops (maybe from the fix for KUDU-3149?) when racing with the scheduler thread.



> Crash in TabletServerDiskErrorTest.TestRandomOpSequence
> -------------------------------------------------------
>
>                 Key: KUDU-3268
>                 URL: https://issues.apache.org/jira/browse/KUDU-3268
>             Project: Kudu
>          Issue Type: Bug
>          Components: test, tserver
>            Reporter: Andrew Wong
>            Priority: Major
>         Attachments: tablet_server-test.3.txt.gz
>
>
> A pre-commit failed with the following crash when attempting to launch an op after stopping a replica:
> {code:java}
> I0323 18:15:01.078991 23854 maintenance_manager.cc:373] P c8a93089db0041f5930b9fb1832714ed: Scheduling CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452
> I0323 18:15:01.079111 21067 tablet_server-test.cc:852] Tablet server responded with: timestamp: 6621279441214984192
> I0323 18:15:01.079317 23789 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: UndoDeltaBlockGCOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.000s	user 0.000s	sys 0.000s Metrics: {"cfile_init":1,"lbm_read_time_us":73,"lbm_reads_lt_1ms":4}
> E0323 18:15:01.080865 23788 cfile_reader.cc:591] Encountered corrupted CFile in filesystem block: 4124746176525068430
> I0323 18:15:01.080960 23788 ts_tablet_manager.cc:1774] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: failing tablet
> I0323 18:15:01.080950 21067 tablet_server-test.cc:852] Tablet server responded with: timestamp: 6621279441223315456
> I0323 18:15:01.081243 24138 tablet_replica.cc:324] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: stopping tablet replica
> I0323 18:15:01.081670 21067 tablet_server-test.cc:852] Tablet server responded with: error {
>   code: TABLET_NOT_RUNNING
>   status {
>     code: ILLEGAL_STATE
>     message: "Tablet not RUNNING: STOPPING"
>   }
> }
> I0323 18:15:01.081777 21067 tablet_server-test.cc:890] Failure was caught by an op!
> W0323 18:15:01.082907 23788 tablet_mm_ops.cc:176] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: Compaction failed on ffffffffffffffffffffffffffffffff: Corruption: Flush to disk failed: checksum error on CFile block 4124746176525068430 at offset=1006 size=24: Checksum does not match: 3582029077 vs expected 3582029077
> I0323 18:15:01.082957 23788 maintenance_manager.cc:594] P c8a93089db0041f5930b9fb1832714ed: CompactRowSetsOp(ffffffffffffffffffffffffffffffff) complete. Timing: real 0.004s	user 0.003s	sys 0.000s Metrics: {"cfile_cache_miss":3,"cfile_cache_miss_bytes":92,"delta_iterators_relevant":2,"dirs.queue_time_us":630,"dirs.run_cpu_time_us":368,"dirs.run_wall_time_us":2220,"lbm_read_time_us":54,"lbm_reads_lt_1ms":3,"lbm_write_time_us":168,"lbm_writes_lt_1ms":6,"num_input_rowsets":2,"spinlock_wait_cycles":1792,"tablet-open.queue_time_us":135,"thread_start_us":382,"threads_started":5}
> I0323 18:15:01.083369 23854 maintenance_manager.cc:373] P c8a93089db0041f5930b9fb1832714ed: Scheduling CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452
> *** Aborted at 1616523301 (unix time) try "date -d @1616523301" if you are using GNU date ***
> I0323 18:15:01.083519 24138 raft_consensus.cc:2226] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 LEADER]: Raft consensus shutting down.
> I0323 18:15:01.083653 24138 raft_consensus.cc:2255] T ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 FOLLOWER]: Raft consensus is shut down!
> I0323 18:15:01.085090 21067 tablet_server-test.cc:894] Tablet was successfully failed
> I0323 18:15:01.085439 21067 tablet_server.cc:166] TabletServer@127.0.0.1:0 shutting down...
> PC: @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp()
> *** SIGSEGV (@0x30) received by PID 21067 (TID 0x7ff96343b700) from PID 48; stack trace: ***
>     @     0x7ff976846980 (unknown) at ??:0
>     @     0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() at ??:0
>     @     0x7ff97596b538 _ZZN4kudu18MaintenanceManager18RunSchedulerThreadEvENKUlvE_clEv at ??:0
>     @     0x7ff97596f124 _ZNSt17_Function_handlerIFvvEZN4kudu18MaintenanceManager18RunSchedulerThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0
>     @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0
>     @     0x7ff975a05e6e kudu::ThreadPool::DispatchThread() at ??:0
>     @     0x7ff975a06757 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv at ??:0
>     @     0x7ff975a07e7b _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0
>     @     0x7ff977e2bcf4 std::function<>::operator()() at ??:0
>     @     0x7ff9759f8913 kudu::Thread::SuperviseThread() at ??:0
>     @     0x7ff97683b6db start_thread at ??:0
>     @     0x7ff97388e71f clone at ??:0
> {code}
> Attached the full logs -- seems there's something unsafe about how we unregister ops (maybe from the fix for KUDU-3149?) when racing with the scheduler thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)