You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arpit Agarwal (JIRA)" <ji...@apache.org> on 2014/08/12 07:12:12 UTC

[jira] [Resolved] (HADOOP-10960) hadoop cause system crash with “soft lock” and “hard lock”

     [ https://issues.apache.org/jira/browse/HADOOP-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arpit Agarwal resolved HADOOP-10960.
------------------------------------

    Resolution: Invalid

Hadoop core has no kernel mode components so it cannot cause a kernel panic. You likely have a buggy device driver or hit a kernel bug.

Resolving as Invalid.

> hadoop cause system crash with “soft lock” and “hard lock”
> ----------------------------------------------------------
>
>                 Key: HADOOP-10960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10960
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: redhat rhel 6.3,6,4,6.5
> jdk1.7.0_45
> hadoop2.2
>            Reporter: linbao111
>            Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a while. /var/log/messages shows repeatedly:
> Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel: CPU 1 
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel: 
> Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G        W  ---------------    2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
> Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>]  [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
> Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8  EFLAGS: 00000202
> Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
> Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
> Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
> Aug 11 06:30:42 jn4_73_128 kernel: FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
> Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500)
> Aug 11 06:30:42 jn4_73_128 kernel: Stack:
> Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> </em>
> and finally crashed
> crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux  /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore
> crash 6.1.0-5.el6
> Copyright (C) 2002-2012  Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
> Copyright (C) 1999-2006  Hewlett-Packard Co
> Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
> Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> Copyright (C) 2005, 2011  NEC Corporation
> Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions.  Enter "help copying" to see the conditions.
> This program has absolutely no warranty.  Enter "help warranty" for details.
> GNU gdb (GDB) 7.3.1
> Copyright (C) 2011 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> please wait... (determining panic task)         
> WARNING: active task ffff881071850040 on cpu 12 not found in PID hash
>       KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
>     DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore  [PARTIAL DUMP]
>         CPUS: 24
>         DATE: Sun Aug 10 09:47:32 2014
>       UPTIME: 7 days, 16:00:19
> LOAD AVERAGE: 11.01, 3.11, 1.08
>        TASKS: 724
>     NODENAME: master1.otocyon.com
>      RELEASE: 2.6.32-431.5.1.el6.x86_64
>      VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
>      MACHINE: x86_64  (1895 Mhz)
>       MEMORY: 64 GB
>        PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
>          PID: 23976
>      COMMAND: "sh"
>         TASK: ffff881071850aa0  [THREAD_INFO: ffff880a05c80000]
>          CPU: 0
>        STATE: TASK_INTERRUPTIBLE (PANIC)
> crash> bt
> PID: 23976  TASK: ffff881071850aa0  CPU: 0   COMMAND: "sh"
>  #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
>  #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
>  #2 [ffff880028207c80] panic at ffffffff8152751a
>  #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
>  #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
>  #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
>  #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
>  #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
>  #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
>  #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
> #10 [ffff880028207ef0] notify_die at ffffffff810a153e
> #11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
> It happened on machines from different vendors,and I have tried to update to the latest kernel from redhat. Can anyone with the same experience help?



--
This message was sent by Atlassian JIRA
(v6.2#6252)