You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Andrew Wong (Jira)" <ji...@apache.org> on 2020/06/18 19:28:00 UTC

[jira] [Created] (KUDU-3151) segfault when starting repairing log block container with a missing LBM container data file

Andrew Wong created KUDU-3151:
---------------------------------

             Summary: segfault when starting repairing log block container with a missing LBM container data file
                 Key: KUDU-3151
                 URL: https://issues.apache.org/jira/browse/KUDU-3151
             Project: Kudu
          Issue Type: Bug
          Components: fs
            Reporter: Andrew Wong
         Attachments: metadump.txt

We upgraded a cluster from 1.7 to 1.12 and saw the following segfault on one node:
{code:java}
*** SIGSEGV (@0x20f2008) received by PID 35899 (TID 0x7ff7e40cc700) from PID 34545672; stack trace: ***
    @     0x7ff7f2a395d0 (unknown)
    @           0x9fe02e std::_Sp_counted_base<>::_M_release()
    @          0x2049f77 kudu::fs::LogBlockManager::Repair()
    @          0x204ae45 kudu::fs::LogBlockManager::RepairTask()
    @          0x228e67e kudu::ThreadPool::DispatchThread()
    @          0x228778f kudu::Thread::SuperviseThread()
    @     0x7ff7f2a31dd5 start_thread
    @     0x7ff7f0d0902d __clone
{code}
When running {{kudu fs check}} we saw the following logs:
{code:java}
I0617 09:17:37.681373 147811 fs_manager.cc:433] Time spent opening block manager: real 10.871s	user 0.215s	sys 0.162s
Not found: Could not open container 74e7b95f8ccb4c7b98e52dc48049e967: /data/5/kudu/tablet/data/data/74e7b95f8ccb4c7b98e52dc48049e967.data: No such file or directory (error 2)
{code}
and upon inspecting the files, we found 74e7b95f8ccb4c7b98e52dc48049e967.data was indeed missing, while the metadata file 74e7b95f8ccb4c7b98e52dc48049e967.metadata was present but non-empty (more creates than deletes, see attached).

We were able to delete the metadata file, and I don't think we saw any failed tablets upon doing so (which may surface if the tablet were unable to find some necessary blocks at startup, eg PK blocks when reading min/max keys).

It's possible the metadata might be left over from a LBM compaction, but it isn't clear what the exact issue is so far. It's also unclear whether the "missing" data file went missing before or after the upgrade, as we didn't run a {{kudu fs check}} before upgrading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)