You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by 刘波 <27...@qq.com.INVALID> on 2022/03/04 04:24:56 UTC

be异常退出

尊敬的开发者,您好:
&nbsp; &nbsp; &nbsp;我们的doris be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;时间: 2022-03-04 10:54:43
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;doris报错:detailMessage = tablet 45664199 has few replicas: 1, alive backends: [10004]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;环境信息:华为云cetnos 8 64位,16核64G
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;系统日志: less /var/log/message(非OOM)
Mar&nbsp; 4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core Dump (PID 1881819/UID 0).Mar&nbsp; 4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file was truncated to 2147483648 bytes.Mar&nbsp; 4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread 1877601:#012#0&nbsp; 0x00000000039480a2 memcpy&nbsp;(/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0&nbsp; 0x00007fccb330efc8 n/a (n/a)Mar&nbsp; 4 10:55:01 narwal-doris-be-0004 systemd[1]: systemd-coredump@1-1881819-0.service: Succeeded.Mar&nbsp; 4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope: Succeeded.

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; dump信息:coredumpctl info 2800630
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;PID: 2800630 (palo_be)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;UID: 0 (root)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;GID: 0 (root)
&nbsp; &nbsp; &nbsp; &nbsp; Signal: 11 (SEGV)
&nbsp; &nbsp; &nbsp;Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
&nbsp; Command Line: /mnt/be/lib/palo_be
&nbsp; &nbsp; Executable: /mnt/be/lib/palo_be
&nbsp;Control Group: /
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Slice: -.slice
&nbsp; &nbsp; &nbsp; &nbsp;Boot ID: ca3ef395d7c547d2aecb1c251097066f
&nbsp; &nbsp; Machine ID: 501f93b5c19d4ca38db845c29176e3c5
&nbsp; &nbsp; &nbsp; Hostname: narwal-doris-be-0004
&nbsp; &nbsp; &nbsp; &nbsp;Storage: /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4 (truncated)
&nbsp; &nbsp; &nbsp; &nbsp;Message: Process 2800630 (palo_be) of user 0 dumped core.
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Stack trace of thread 1877601:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #0&nbsp; 0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Stack trace of thread 2800630:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #0&nbsp; 0x00007fccb330efc8 n/a (n/a)

回复:be异常退出

Posted by 41108453 <41...@qq.com.INVALID>.
这个可能要你提供core文件进行gdb调试来看了,你的微信多少,加你微信




------------------&nbsp;原始邮件&nbsp;------------------
发件人: "刘波"<270309321@qq.com.INVALID&gt;; 
发送时间: 2022年3月4日(星期五) 中午12:24
收件人: "dev"<dev@doris.apache.org&gt;; 
主题: be异常退出



尊敬的开发者,您好:
&nbsp; &nbsp; &nbsp;我们的doris be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;时间: 2022-03-04 10:54:43
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;doris报错:detailMessage = tablet 45664199 has few replicas: 1, alive backends: [10004]
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;环境信息:华为云cetnos 8 64位,16核64G
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;系统日志: less /var/log/message(非OOM)
Mar&nbsp; 4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core Dump (PID 1881819/UID 0).Mar&nbsp; 4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file was truncated to 2147483648 bytes.Mar&nbsp; 4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread 1877601:#012#0&nbsp; 0x00000000039480a2 memcpy&nbsp;(/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0&nbsp; 0x00007fccb330efc8 n/a (n/a)Mar&nbsp; 4 10:55:01 narwal-doris-be-0004 systemd[1]: systemd-coredump@1-1881819-0.service: Succeeded.Mar&nbsp; 4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope: Succeeded.

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; dump信息:coredumpctl info 2800630
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;PID: 2800630 (palo_be)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;UID: 0 (root)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;GID: 0 (root)
&nbsp; &nbsp; &nbsp; &nbsp; Signal: 11 (SEGV)
&nbsp; &nbsp; &nbsp;Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
&nbsp; Command Line: /mnt/be/lib/palo_be
&nbsp; &nbsp; Executable: /mnt/be/lib/palo_be
&nbsp;Control Group: /
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Slice: -.slice
&nbsp; &nbsp; &nbsp; &nbsp;Boot ID: ca3ef395d7c547d2aecb1c251097066f
&nbsp; &nbsp; Machine ID: 501f93b5c19d4ca38db845c29176e3c5
&nbsp; &nbsp; &nbsp; Hostname: narwal-doris-be-0004
&nbsp; &nbsp; &nbsp; &nbsp;Storage: /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4 (truncated)
&nbsp; &nbsp; &nbsp; &nbsp;Message: Process 2800630 (palo_be) of user 0 dumped core.
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Stack trace of thread 1877601:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #0&nbsp; 0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Stack trace of thread 2800630:
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #0&nbsp; 0x00007fccb330efc8 n/a (n/a)

回复: be异常退出

Posted by 刘波 <27...@qq.com.INVALID>.
开发者,您好:
&nbsp; &nbsp; gdb查看的结果如下图,可以看到关键信息signal SIGSEGV,但不会进一步分析具体异常原因,烦请帮忙看下,谢谢!






------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <wangbo13131@gmail.com&gt;;
发送时间:&nbsp;2022年3月4日(星期五) 中午1:31
收件人:&nbsp;"dev"<dev@doris.apache.org&gt;;

主题:&nbsp;Re: be异常退出



目前的堆栈看起来不足以支撑做出判断;
如果线上有开core dump的话,可以用gdb palo_be core_dump文件看看堆栈
后者看下be.out是否进程怪盗时的堆栈

刘波 <270309321@qq.com.invalid&gt; 于2022年3月4日周五 12:25写道:

&gt; 尊敬的开发者,您好:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 我们的doris
&gt; be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *时间: 2022-03-04 10:54:43*
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; doris报错*:detailMessage = tablet 45664199 has few replicas: 1,
&gt; alive backends: [10004]*
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 环境信息:华为云cetnos 8 64位,16核64G
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 系统日志: less /var/log/message(*非OOM*)
&gt;
&gt; Mar&nbsp; 4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core Dump
&gt; (PID 1881819/UID 0).
&gt;
&gt; Mar&nbsp; 4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file
&gt; was truncated to 2147483648 bytes.
&gt;
&gt; Mar&nbsp; 4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process
&gt; 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread
&gt; 1877601:#012#0&nbsp; 0x00000000039480a2 memcpy
&gt;
&gt;&nbsp; (/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0
&gt; 0x00007fccb330efc8 n/a (n/a)
&gt;
&gt; Mar&nbsp; 4 10:55:01 narwal-doris-be-0004 systemd[1]:
&gt; systemd-coredump@1-1881819-0.service: Succeeded.
&gt;
&gt; Mar&nbsp; 4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope:
&gt; Succeeded.
&gt;
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; dump信息:*coredumpctl info 2800630*
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PID: 2800630 (palo_be)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; UID: 0 (root)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GID: 0 (root)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Signal: 11 (SEGV)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
&gt;&nbsp;&nbsp; Command Line: /mnt/be/lib/palo_be
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Executable: /mnt/be/lib/palo_be
&gt;&nbsp; Control Group: /
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Slice: -.slice
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Boot ID: ca3ef395d7c547d2aecb1c251097066f
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Machine ID: 501f93b5c19d4ca38db845c29176e3c5
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Hostname: narwal-doris-be-0004
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Storage:
&gt; /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4
&gt; (truncated)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Message: Process 2800630 (palo_be) of user 0 dumped core.
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Stack trace of thread 1877601:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #0&nbsp; 0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Stack trace of thread 2800630:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #0&nbsp; 0x00007fccb330efc8 n/a (n/a)
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; ---------------------------------------------------------------------
&gt; To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
&gt; For additional commands, e-mail: dev-help@doris.apache.org



-- 
王博&nbsp; Wang Bo

Re: be异常退出

Posted by 王博 <wa...@gmail.com>.
当发生宕机时,你可以在be.out里找到这样的堆栈,这种是有效的信息

PC: @          0x25fd94b tcmalloc::CentralFreeList::FetchFromOneSpans()

*** SIGSEGV (@0x0) received by PID 71566 (TID 0x7f46785f8700) from PID 0;
stack trace: ***

    @     0x7f46e6dea5d0 (unknown)

    @          0x25fd94b tcmalloc::CentralFreeList::FetchFromOneSpans()

    @          0x25fdc1c tcmalloc::CentralFreeList::FetchFromOneSpansSafe()

    @          0x25fdd17 tcmalloc::CentralFreeList::RemoveRange()

    @          0x260a1e3 tcmalloc::ThreadCache::FetchFromCentralCache()

    @          0x23c4818
google::protobuf::internal::RepeatedPtrFieldBase::InternalExtend()

    @           0xdaac98 doris::ColumnWriter::finalize()

    @           0xdb5038 doris::DoubleColumnWriterBase<>::finalize()

    @           0xd91adb doris::SegmentWriter::_make_file_header()

    @           0xd9254b doris::SegmentWriter::finalize()

    @           0xd67e76 doris::ColumnDataWriter::_finalize_segment()

    @           0xd694de doris::ColumnDataWriter::finalize()

    @           0xd2db3c doris::SchemaChangeDirectly::process()

    @           0xd3043c doris::SchemaChangeHandler::_alter_table()

    @           0xd340f9 doris::SchemaChangeHandler::_do_alter_table()

    @           0xd35283 doris::SchemaChangeHandler::process_alter_table()

    @           0xcaa28b doris::OLAPEngine::schema_change()

    @          0x11bf8be doris::TaskWorkerPool::_alter_table()

    @          0x11c9a55
doris::TaskWorkerPool::_alter_table_worker_thread_callback()

    @     0x7f46e6de2dd5 start_thread

    @     0x7f46e61e802d __clone


又或者你可以使用gdb得到更详细的代码堆栈

从你目前给出的堆栈中,我没有看到有效的信息

刘波 <27...@qq.com.invalid> 于2022年3月9日周三 14:59写道:

> 尊敬的开发者,您好:
>      今天be再次异常,其中be.out信息如下
>
>
>     gdb结果如下
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "dev" <wa...@gmail.com>;
> *发送时间:* 2022年3月4日(星期五) 中午1:31
> *收件人:* "dev"<de...@doris.apache.org>;
> *主题:* Re: be异常退出
>
> 目前的堆栈看起来不足以支撑做出判断;
> 如果线上有开core dump的话,可以用gdb palo_be core_dump文件看看堆栈
> 后者看下be.out是否进程怪盗时的堆栈
>
> 刘波 <27...@qq.com.invalid> 于2022年3月4日周五 12:25写道:
>
> > 尊敬的开发者,您好:
> >      我们的doris
> > be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
> >            *时间: 2022-03-04 10:54:43*
> >            doris报错*:detailMessage = tablet 45664199 has few replicas: 1,
> > alive backends: [10004]*
> >            环境信息:华为云cetnos 8 64位,16核64G
> >
> >            系统日志: less /var/log/message(*非OOM*)
> >
> > Mar  4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core
> Dump
> > (PID 1881819/UID 0).
> >
> > Mar  4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file
> > was truncated to 2147483648 bytes.
> >
> > Mar  4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process
> > 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread
> > 1877601:#012#0  0x00000000039480a2 memcpy
> >
> >  (/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0
> > 0x00007fccb330efc8 n/a (n/a)
> >
> > Mar  4 10:55:01 narwal-doris-be-0004 systemd[1]:
> > systemd-coredump@1-1881819-0.service: Succeeded.
> >
> > Mar  4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope:
> > Succeeded.
> >
> >
> >           dump信息:*coredumpctl info 2800630*
> >
> >            PID: 2800630 (palo_be)
> >            UID: 0 (root)
> >            GID: 0 (root)
> >         Signal: 11 (SEGV)
> >      Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
> >   Command Line: /mnt/be/lib/palo_be
> >     Executable: /mnt/be/lib/palo_be
> >  Control Group: /
> >          Slice: -.slice
> >        Boot ID: ca3ef395d7c547d2aecb1c251097066f
> >     Machine ID: 501f93b5c19d4ca38db845c29176e3c5
> >       Hostname: narwal-doris-be-0004
> >        Storage:
> >
> /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4
> > (truncated)
> >        Message: Process 2800630 (palo_be) of user 0 dumped core.
> >
> >                 Stack trace of thread 1877601:
> >                 #0  0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
> >
> >                 Stack trace of thread 2800630:
> >                 #0  0x00007fccb330efc8 n/a (n/a)
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
> > For additional commands, e-mail: dev-help@doris.apache.org
>
>
>
> --
> 王博  Wang Bo
>
>

-- 
王博  Wang Bo

回复: be异常退出

Posted by 刘波 <27...@qq.com.INVALID>.
尊敬的开发者,您好:
&nbsp; &nbsp; &nbsp;今天be再次异常,其中be.out信息如下
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;


&nbsp; &nbsp; gdb结果如下
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dev"                                                                                    <wangbo13131@gmail.com&gt;;
发送时间:&nbsp;2022年3月4日(星期五) 中午1:31
收件人:&nbsp;"dev"<dev@doris.apache.org&gt;;

主题:&nbsp;Re: be异常退出



目前的堆栈看起来不足以支撑做出判断;
如果线上有开core dump的话,可以用gdb palo_be core_dump文件看看堆栈
后者看下be.out是否进程怪盗时的堆栈

刘波 <270309321@qq.com.invalid&gt; 于2022年3月4日周五 12:25写道:

&gt; 尊敬的开发者,您好:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 我们的doris
&gt; be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *时间: 2022-03-04 10:54:43*
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; doris报错*:detailMessage = tablet 45664199 has few replicas: 1,
&gt; alive backends: [10004]*
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 环境信息:华为云cetnos 8 64位,16核64G
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 系统日志: less /var/log/message(*非OOM*)
&gt;
&gt; Mar&nbsp; 4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core Dump
&gt; (PID 1881819/UID 0).
&gt;
&gt; Mar&nbsp; 4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file
&gt; was truncated to 2147483648 bytes.
&gt;
&gt; Mar&nbsp; 4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process
&gt; 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread
&gt; 1877601:#012#0&nbsp; 0x00000000039480a2 memcpy
&gt;
&gt;&nbsp; (/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0
&gt; 0x00007fccb330efc8 n/a (n/a)
&gt;
&gt; Mar&nbsp; 4 10:55:01 narwal-doris-be-0004 systemd[1]:
&gt; systemd-coredump@1-1881819-0.service: Succeeded.
&gt;
&gt; Mar&nbsp; 4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope:
&gt; Succeeded.
&gt;
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; dump信息:*coredumpctl info 2800630*
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PID: 2800630 (palo_be)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; UID: 0 (root)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GID: 0 (root)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Signal: 11 (SEGV)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
&gt;&nbsp;&nbsp; Command Line: /mnt/be/lib/palo_be
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Executable: /mnt/be/lib/palo_be
&gt;&nbsp; Control Group: /
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Slice: -.slice
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Boot ID: ca3ef395d7c547d2aecb1c251097066f
&gt;&nbsp;&nbsp;&nbsp;&nbsp; Machine ID: 501f93b5c19d4ca38db845c29176e3c5
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Hostname: narwal-doris-be-0004
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Storage:
&gt; /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4
&gt; (truncated)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Message: Process 2800630 (palo_be) of user 0 dumped core.
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Stack trace of thread 1877601:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #0&nbsp; 0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Stack trace of thread 2800630:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #0&nbsp; 0x00007fccb330efc8 n/a (n/a)
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; ---------------------------------------------------------------------
&gt; To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
&gt; For additional commands, e-mail: dev-help@doris.apache.org



-- 
王博&nbsp; Wang Bo

Re: be异常退出

Posted by 王博 <wa...@gmail.com>.
目前的堆栈看起来不足以支撑做出判断;
如果线上有开core dump的话,可以用gdb palo_be core_dump文件看看堆栈
后者看下be.out是否进程怪盗时的堆栈

刘波 <27...@qq.com.invalid> 于2022年3月4日周五 12:25写道:

> 尊敬的开发者,您好:
>      我们的doris
> be节点经常会挂掉其中几个,经分析资源情况正常,具体信息可参见附件,日志层面未发现是OOM,未能定位出异常,请求协助,具体信息如下:
>            *时间: 2022-03-04 10:54:43*
>            doris报错*:detailMessage = tablet 45664199 has few replicas: 1,
> alive backends: [10004]*
>            环境信息:华为云cetnos 8 64位,16核64G
>
>            系统日志: less /var/log/message(*非OOM*)
>
> Mar  4 10:54:38 narwal-doris-be-0004 systemd[1]: Started Process Core Dump
> (PID 1881819/UID 0).
>
> Mar  4 10:54:41 narwal-doris-be-0004 systemd-coredump[1881820]: Core file
> was truncated to 2147483648 bytes.
>
> Mar  4 10:55:00 narwal-doris-be-0004 systemd-coredump[1881820]: Process
> 2800630 (palo_be) of user 0 dumped core.#012#012Stack trace of thread
> 1877601:#012#0  0x00000000039480a2 memcpy
>
>  (/mnt/be/lib/palo_be)#012#012Stack trace of thread 2800630:#012#0
> 0x00007fccb330efc8 n/a (n/a)
>
> Mar  4 10:55:01 narwal-doris-be-0004 systemd[1]:
> systemd-coredump@1-1881819-0.service: Succeeded.
>
> Mar  4 10:55:03 narwal-doris-be-0004 systemd[1]: session-12.scope:
> Succeeded.
>
>
>           dump信息:*coredumpctl info 2800630*
>
>            PID: 2800630 (palo_be)
>            UID: 0 (root)
>            GID: 0 (root)
>         Signal: 11 (SEGV)
>      Timestamp: Fri 2022-03-04 10:54:38 CST (1h 10min ago)
>   Command Line: /mnt/be/lib/palo_be
>     Executable: /mnt/be/lib/palo_be
>  Control Group: /
>          Slice: -.slice
>        Boot ID: ca3ef395d7c547d2aecb1c251097066f
>     Machine ID: 501f93b5c19d4ca38db845c29176e3c5
>       Hostname: narwal-doris-be-0004
>        Storage:
> /var/lib/systemd/coredump/core.palo_be.0.ca3ef395d7c547d2aecb1c251097066f.2800630.1646362478000000.lz4
> (truncated)
>        Message: Process 2800630 (palo_be) of user 0 dumped core.
>
>                 Stack trace of thread 1877601:
>                 #0  0x00000000039480a2 memcpy (/mnt/be/lib/palo_be)
>
>                 Stack trace of thread 2800630:
>                 #0  0x00007fccb330efc8 n/a (n/a)
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
> For additional commands, e-mail: dev-help@doris.apache.org



-- 
王博  Wang Bo