You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arpit Agarwal (JIRA)" <ji...@apache.org> on 2014/08/12 07:12:12 UTC
[jira] [Resolved] (HADOOP-10960) hadoop cause system crash with “soft lock” and “hard lock”
[ https://issues.apache.org/jira/browse/HADOOP-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arpit Agarwal resolved HADOOP-10960.
------------------------------------
Resolution: Invalid
Hadoop core has no kernel mode components so it cannot cause a kernel panic. You likely have a buggy device driver or hit a kernel bug.
Resolving as Invalid.
> hadoop cause system crash with “soft lock” and “hard lock”
> ----------------------------------------------------------
>
> Key: HADOOP-10960
> URL: https://issues.apache.org/jira/browse/HADOOP-10960
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.2.0
> Environment: redhat rhel 6.3,6,4,6.5
> jdk1.7.0_45
> hadoop2.2
> Reporter: linbao111
> Priority: Critical
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> I am running hadoop2.2 on redhat6.3-6.5,and all of my machines crashed after a while. /var/log/messages shows repeatedly:
> Aug 11 06:30:42 jn4_73_128 kernel: BUG: soft lockup - CPU#1 stuck for 67s! [jsvc:11508]
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel: CPU 1
> Aug 11 06:30:42 jn4_73_128 kernel: Modules linked in: bridge stp llc iptable_filter ip_tables mptctl mptbase xfs exportfs power_meter microcode dcdbas serio_raw iTCO_w
> dt iTCO_vendor_support i7core_edac edac_core sg bnx2 ext4 mbcache jbd2 sd_mod crc_t10dif wmi mpt2sas scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_m
> od [last unloaded: scsi_wait_scan]
> Aug 11 06:30:42 jn4_73_128 kernel:
> Aug 11 06:30:42 jn4_73_128 kernel: Pid: 11508, comm: jsvc Tainted: G W --------------- 2.6.32-279.el6.x86_64 #1 Dell Inc. PowerEdge R510/084YMW
> Aug 11 06:30:42 jn4_73_128 kernel: RIP: 0010:[<ffffffff8104d088>] [<ffffffff8104d088>] wait_for_rqlock+0x28/0x40
> Aug 11 06:30:42 jn4_73_128 kernel: RSP: 0018:ffff8807786c3ee8 EFLAGS: 00000202
> Aug 11 06:30:42 jn4_73_128 kernel: RAX: 00000000f6e9f6e1 RBX: ffff8807786c3ee8 RCX: ffff880028216680
> Aug 11 06:30:42 jn4_73_128 kernel: RDX: 00000000fffff6e9 RSI: ffff88061cd29370 RDI: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: RBP: ffffffff8100bc0e R08: 0000000000000001 R09: 0000000000000001
> Aug 11 06:30:42 jn4_73_128 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000286
> Aug 11 06:30:42 jn4_73_128 kernel: R13: ffff8807786c3eb8 R14: ffffffff810e0f6e R15: ffff8807786c3e48
> Aug 11 06:30:42 jn4_73_128 kernel: FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug 11 06:30:42 jn4_73_128 kernel: CR2: 0000000000e5bd70 CR3: 0000000001a85000 CR4: 00000000000006e0
> Aug 11 06:30:42 jn4_73_128 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Aug 11 06:30:42 jn4_73_128 kernel: Process jsvc (pid: 11508, threadinfo ffff8807786c2000, task ffff880c1def3500)
> Aug 11 06:30:42 jn4_73_128 kernel: Stack:
> Aug 11 06:30:42 jn4_73_128 kernel: ffff8807786c3f68 ffffffff8107091b 0000000000000000 ffff8807786c3f28
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff880701735260 ffff880c1def39c8 ffff880c1def39c8 0000000000000000
> Aug 11 06:30:42 jn4_73_128 kernel: <d> ffff8807786c3f28 ffff8807786c3f28 ffff8807786c3f78 00007f092d0ad700
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> Aug 11 06:30:42 jn4_73_128 kernel: Code: ff ff 90 55 48 89 e5 0f 1f 44 00 00 48 c7 c0 80 66 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 <f3> 90 8b 01 89 c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00
> Aug 11 06:30:42 jn4_73_128 kernel: Call Trace:
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8107091b>] ? do_exit+0x5ab/0x870
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff81070ce7>] ? sys_exit+0x17/0x20
> Aug 11 06:30:42 jn4_73_128 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
> </em>
> and finally crashed
> crash /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux /opt/crash/127.0.0.1-2014-08-10-09\:47\:38/vmcore
> crash 6.1.0-5.el6
> Copyright (C) 2002-2012 Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
> Copyright (C) 1999-2006 Hewlett-Packard Co
> Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
> Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
> Copyright (C) 2005, 2011 NEC Corporation
> Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions. Enter "help copying" to see the conditions.
> This program has absolutely no warranty. Enter "help warranty" for details.
> GNU gdb (GDB) 7.3.1
> Copyright (C) 2011 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> please wait... (determining panic task)
> WARNING: active task ffff881071850040 on cpu 12 not found in PID hash
> KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
> DUMPFILE: /opt/crash/127.0.0.1-2014-08-10-09:47:38/vmcore [PARTIAL DUMP]
> CPUS: 24
> DATE: Sun Aug 10 09:47:32 2014
> UPTIME: 7 days, 16:00:19
> LOAD AVERAGE: 11.01, 3.11, 1.08
> TASKS: 724
> NODENAME: master1.otocyon.com
> RELEASE: 2.6.32-431.5.1.el6.x86_64
> VERSION: #1 SMP Fri Jan 10 14:46:43 EST 2014
> MACHINE: x86_64 (1895 Mhz)
> MEMORY: 64 GB
> PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
> PID: 23976
> COMMAND: "sh"
> TASK: ffff881071850aa0 [THREAD_INFO: ffff880a05c80000]
> CPU: 0
> STATE: TASK_INTERRUPTIBLE (PANIC)
> crash> bt
> PID: 23976 TASK: ffff881071850aa0 CPU: 0 COMMAND: "sh"
> #0 [ffff880028207b50] machine_kexec at ffffffff81038f3b
> #1 [ffff880028207bb0] crash_kexec at ffffffff810c5d82
> #2 [ffff880028207c80] panic at ffffffff8152751a
> #3 [ffff880028207d00] watchdog_overflow_callback at ffffffff810e696d
> #4 [ffff880028207d20] __perf_event_overflow at ffffffff8111c847
> #5 [ffff880028207da0] perf_event_overflow at ffffffff8111ce14
> #6 [ffff880028207db0] intel_pmu_handle_irq at ffffffff81022d87
> #7 [ffff880028207e90] perf_event_nmi_handler at ffffffff8152bd69
> #8 [ffff880028207ea0] notifier_call_chain at ffffffff8152d825
> #9 [ffff880028207ee0] atomic_notifier_call_chain at ffffffff8152d88a
> #10 [ffff880028207ef0] notify_die at ffffffff810a153e
> #11 [ffff880028207f20] do_nmi at ffffffff8152b4eb
> It happened on machines from different vendors,and I have tried to update to the latest kernel from redhat. Can anyone with the same experience help?
--
This message was sent by Atlassian JIRA
(v6.2#6252)