You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2013/12/11 23:25:53 UTC

secondary namenode is hang at post

It just happens without changing anything in the cluster. Secondary
namenode node has been working fine until today i notice that in second
namenode log file stop at.

2013-12-11 13:17:41,083 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 3941631662 saved in 61 seconds.
2013-12-11 13:19:15,446 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 3941631662 saved in 94 seconds.
2013-12-11 13:19:29,760 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac


#date
Wed Dec 11 14:16:30 PST 2013

even after 1 hour passed it's not finish doing the checkpoint. looking at
timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
progress in size and timestamp of the file is days ago.

already tried to clean the current in snn and restart secondarynamenode
process, but SNN still stop at the same spot even thought the snn process
is still exist.


-P

Re: secondary namenode is hang at post

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks adam,

2013-12-11 14:43:07
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.7-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000041057000 nid=0x6b95 waiting
on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007fb6e0009800 nid=0x1b6f waiting on
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"SecondaryNameNode Status
Name Node Address    : pnn/x.x.x.x:9000
Start Time           : Wed Dec 11 13:04:22 PST 2013
Last Checkpoint Time : --
Checkpoint Period    : 3600 seconds
Checkpoint Size      : 67108864 MB
Checkpoint Dirs      : [/hadoop/namesecondary, /hadoop/nfs_namesecondary]
Checkpoint Edits Dirs: [/hadoop/namesecondary, /hadoop/nfs_namesecondary]"
daemon prio=10 tid=0x00007fb6e0471800 nid=0x1b95 runnable
[0x00007fad08f43000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00007fb4077b5108> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:690)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x00007fb4077a9218> (a
sun.net.www.protocol.http.HttpURLConnection)
at
sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2300)
at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:192)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:404)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:333)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
at java.lang.Thread.run(Thread.java:662)

"Timer-0" daemon prio=10 tid=0x00007fb6e0484000 nid=0x1b94 in Object.wait()
[0x00007fad09044000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)

"1408732008@qtp-421539177-1 - Acceptor0 SelectChannelConnector@snn:50090"
prio=10 tid=0x00007fb6e0478800 nid=0x1b93 runnable [0x00007fad09145000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00007fad3e6c8f48> (a sun.nio.ch.Util$2)
- locked <0x00007fad3e6c8f60> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007fad3e6c8ed0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
at
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

"162291475@qtp-421539177-0" prio=10 tid=0x00007fb6e0477800 nid=0x1b92 in
Object.wait() [0x00007fad09246000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"RMI TCP Accept-0" daemon prio=10 tid=0x00007fb6e0322000 nid=0x1b8e
runnable [0x00007fad09cd1000]
   java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
- locked <0x00007fad3e671330> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x00007fb6e00b8800 nid=0x1b8d
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb6e00b6800 nid=0x1b8c
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb6e00b3800 nid=0x1b8b
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007fb6e00b1800 nid=0x1b8a
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007fb6e0094800 nid=0x1b89 in
Object.wait() [0x00007fad0aa8d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00007fb6e0092800 nid=0x1b88 in
Object.wait() [0x00007fad0ab8e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007fb6e008c000 nid=0x1b87 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fb6e001c800 nid=0x1b70
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fb6e001e800 nid=0x1b71
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fb6e0020000 nid=0x1b72
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fb6e0022000 nid=0x1b73
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fb6e0024000 nid=0x1b74
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fb6e0025800 nid=0x1b75
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fb6e0027800 nid=0x1b76
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fb6e0029800 nid=0x1b77
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fb6e002b000 nid=0x1b78
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fb6e002d000 nid=0x1b79
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007fb6e002f000 nid=0x1b7a
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007fb6e0030800 nid=0x1b7b
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007fb6e0032800 nid=0x1b7c
runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007fb6e0034800 nid=0x1b7d
runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007fb6e0036000 nid=0x1b7e
runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007fb6e0038000 nid=0x1b7f
runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007fb6e003a000 nid=0x1b80
runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007fb6e003b800 nid=0x1b81
runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007fb6e003d800 nid=0x1b82
runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007fb6e003f800 nid=0x1b83
runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007fb6e0041000 nid=0x1b84
runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007fb6e0043000 nid=0x1b85
runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007fb6e0045000 nid=0x1b86
runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb6e0335800 nid=0x1b8f waiting
on condition

JNI global references: 1363



On Wed, Dec 11, 2013 at 2:40 PM, Adam Kawa <ka...@gmail.com> wrote:

> It looks that it can not copy the new checkpoint into the NameNode. Can
> you copy-past what jstack says?
>
> $ sudo -u hdfs jstack <snn-pid>
>
>
>
> 2013/12/11 Patai Sangbutsarakum <si...@gmail.com>
>
>> It just happens without changing anything in the cluster. Secondary
>> namenode node has been working fine until today i notice that in second
>> namenode log file stop at.
>>
>> 2013-12-11 13:17:41,083 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 61 seconds.
>> 2013-12-11 13:19:15,446 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 94 seconds.
>> 2013-12-11 13:19:29,760 INFO
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
>> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>>
>>
>> #date
>> Wed Dec 11 14:16:30 PST 2013
>>
>> even after 1 hour passed it's not finish doing the checkpoint. looking at
>> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
>> progress in size and timestamp of the file is days ago.
>>
>> already tried to clean the current in snn and restart secondarynamenode
>> process, but SNN still stop at the same spot even thought the snn process
>> is still exist.
>>
>>
>> -P
>>
>
>

Re: secondary namenode is hang at post

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks adam,

2013-12-11 14:43:07
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.7-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000041057000 nid=0x6b95 waiting
on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007fb6e0009800 nid=0x1b6f waiting on
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"SecondaryNameNode Status
Name Node Address    : pnn/x.x.x.x:9000
Start Time           : Wed Dec 11 13:04:22 PST 2013
Last Checkpoint Time : --
Checkpoint Period    : 3600 seconds
Checkpoint Size      : 67108864 MB
Checkpoint Dirs      : [/hadoop/namesecondary, /hadoop/nfs_namesecondary]
Checkpoint Edits Dirs: [/hadoop/namesecondary, /hadoop/nfs_namesecondary]"
daemon prio=10 tid=0x00007fb6e0471800 nid=0x1b95 runnable
[0x00007fad08f43000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00007fb4077b5108> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:690)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x00007fb4077a9218> (a
sun.net.www.protocol.http.HttpURLConnection)
at
sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2300)
at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:192)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:404)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:333)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
at java.lang.Thread.run(Thread.java:662)

"Timer-0" daemon prio=10 tid=0x00007fb6e0484000 nid=0x1b94 in Object.wait()
[0x00007fad09044000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)

"1408732008@qtp-421539177-1 - Acceptor0 SelectChannelConnector@snn:50090"
prio=10 tid=0x00007fb6e0478800 nid=0x1b93 runnable [0x00007fad09145000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00007fad3e6c8f48> (a sun.nio.ch.Util$2)
- locked <0x00007fad3e6c8f60> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007fad3e6c8ed0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
at
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

"162291475@qtp-421539177-0" prio=10 tid=0x00007fb6e0477800 nid=0x1b92 in
Object.wait() [0x00007fad09246000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"RMI TCP Accept-0" daemon prio=10 tid=0x00007fb6e0322000 nid=0x1b8e
runnable [0x00007fad09cd1000]
   java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
- locked <0x00007fad3e671330> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x00007fb6e00b8800 nid=0x1b8d
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb6e00b6800 nid=0x1b8c
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb6e00b3800 nid=0x1b8b
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007fb6e00b1800 nid=0x1b8a
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007fb6e0094800 nid=0x1b89 in
Object.wait() [0x00007fad0aa8d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00007fb6e0092800 nid=0x1b88 in
Object.wait() [0x00007fad0ab8e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007fb6e008c000 nid=0x1b87 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fb6e001c800 nid=0x1b70
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fb6e001e800 nid=0x1b71
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fb6e0020000 nid=0x1b72
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fb6e0022000 nid=0x1b73
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fb6e0024000 nid=0x1b74
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fb6e0025800 nid=0x1b75
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fb6e0027800 nid=0x1b76
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fb6e0029800 nid=0x1b77
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fb6e002b000 nid=0x1b78
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fb6e002d000 nid=0x1b79
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007fb6e002f000 nid=0x1b7a
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007fb6e0030800 nid=0x1b7b
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007fb6e0032800 nid=0x1b7c
runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007fb6e0034800 nid=0x1b7d
runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007fb6e0036000 nid=0x1b7e
runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007fb6e0038000 nid=0x1b7f
runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007fb6e003a000 nid=0x1b80
runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007fb6e003b800 nid=0x1b81
runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007fb6e003d800 nid=0x1b82
runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007fb6e003f800 nid=0x1b83
runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007fb6e0041000 nid=0x1b84
runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007fb6e0043000 nid=0x1b85
runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007fb6e0045000 nid=0x1b86
runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb6e0335800 nid=0x1b8f waiting
on condition

JNI global references: 1363



On Wed, Dec 11, 2013 at 2:40 PM, Adam Kawa <ka...@gmail.com> wrote:

> It looks that it can not copy the new checkpoint into the NameNode. Can
> you copy-past what jstack says?
>
> $ sudo -u hdfs jstack <snn-pid>
>
>
>
> 2013/12/11 Patai Sangbutsarakum <si...@gmail.com>
>
>> It just happens without changing anything in the cluster. Secondary
>> namenode node has been working fine until today i notice that in second
>> namenode log file stop at.
>>
>> 2013-12-11 13:17:41,083 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 61 seconds.
>> 2013-12-11 13:19:15,446 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 94 seconds.
>> 2013-12-11 13:19:29,760 INFO
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
>> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>>
>>
>> #date
>> Wed Dec 11 14:16:30 PST 2013
>>
>> even after 1 hour passed it's not finish doing the checkpoint. looking at
>> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
>> progress in size and timestamp of the file is days ago.
>>
>> already tried to clean the current in snn and restart secondarynamenode
>> process, but SNN still stop at the same spot even thought the snn process
>> is still exist.
>>
>>
>> -P
>>
>
>

Re: secondary namenode is hang at post

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks adam,

2013-12-11 14:43:07
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.7-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000041057000 nid=0x6b95 waiting
on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007fb6e0009800 nid=0x1b6f waiting on
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"SecondaryNameNode Status
Name Node Address    : pnn/x.x.x.x:9000
Start Time           : Wed Dec 11 13:04:22 PST 2013
Last Checkpoint Time : --
Checkpoint Period    : 3600 seconds
Checkpoint Size      : 67108864 MB
Checkpoint Dirs      : [/hadoop/namesecondary, /hadoop/nfs_namesecondary]
Checkpoint Edits Dirs: [/hadoop/namesecondary, /hadoop/nfs_namesecondary]"
daemon prio=10 tid=0x00007fb6e0471800 nid=0x1b95 runnable
[0x00007fad08f43000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00007fb4077b5108> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:690)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x00007fb4077a9218> (a
sun.net.www.protocol.http.HttpURLConnection)
at
sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2300)
at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:192)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:404)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:333)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
at java.lang.Thread.run(Thread.java:662)

"Timer-0" daemon prio=10 tid=0x00007fb6e0484000 nid=0x1b94 in Object.wait()
[0x00007fad09044000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)

"1408732008@qtp-421539177-1 - Acceptor0 SelectChannelConnector@snn:50090"
prio=10 tid=0x00007fb6e0478800 nid=0x1b93 runnable [0x00007fad09145000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00007fad3e6c8f48> (a sun.nio.ch.Util$2)
- locked <0x00007fad3e6c8f60> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007fad3e6c8ed0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
at
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

"162291475@qtp-421539177-0" prio=10 tid=0x00007fb6e0477800 nid=0x1b92 in
Object.wait() [0x00007fad09246000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"RMI TCP Accept-0" daemon prio=10 tid=0x00007fb6e0322000 nid=0x1b8e
runnable [0x00007fad09cd1000]
   java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
- locked <0x00007fad3e671330> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x00007fb6e00b8800 nid=0x1b8d
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb6e00b6800 nid=0x1b8c
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb6e00b3800 nid=0x1b8b
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007fb6e00b1800 nid=0x1b8a
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007fb6e0094800 nid=0x1b89 in
Object.wait() [0x00007fad0aa8d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00007fb6e0092800 nid=0x1b88 in
Object.wait() [0x00007fad0ab8e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007fb6e008c000 nid=0x1b87 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fb6e001c800 nid=0x1b70
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fb6e001e800 nid=0x1b71
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fb6e0020000 nid=0x1b72
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fb6e0022000 nid=0x1b73
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fb6e0024000 nid=0x1b74
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fb6e0025800 nid=0x1b75
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fb6e0027800 nid=0x1b76
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fb6e0029800 nid=0x1b77
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fb6e002b000 nid=0x1b78
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fb6e002d000 nid=0x1b79
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007fb6e002f000 nid=0x1b7a
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007fb6e0030800 nid=0x1b7b
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007fb6e0032800 nid=0x1b7c
runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007fb6e0034800 nid=0x1b7d
runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007fb6e0036000 nid=0x1b7e
runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007fb6e0038000 nid=0x1b7f
runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007fb6e003a000 nid=0x1b80
runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007fb6e003b800 nid=0x1b81
runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007fb6e003d800 nid=0x1b82
runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007fb6e003f800 nid=0x1b83
runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007fb6e0041000 nid=0x1b84
runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007fb6e0043000 nid=0x1b85
runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007fb6e0045000 nid=0x1b86
runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb6e0335800 nid=0x1b8f waiting
on condition

JNI global references: 1363



On Wed, Dec 11, 2013 at 2:40 PM, Adam Kawa <ka...@gmail.com> wrote:

> It looks that it can not copy the new checkpoint into the NameNode. Can
> you copy-past what jstack says?
>
> $ sudo -u hdfs jstack <snn-pid>
>
>
>
> 2013/12/11 Patai Sangbutsarakum <si...@gmail.com>
>
>> It just happens without changing anything in the cluster. Secondary
>> namenode node has been working fine until today i notice that in second
>> namenode log file stop at.
>>
>> 2013-12-11 13:17:41,083 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 61 seconds.
>> 2013-12-11 13:19:15,446 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 94 seconds.
>> 2013-12-11 13:19:29,760 INFO
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
>> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>>
>>
>> #date
>> Wed Dec 11 14:16:30 PST 2013
>>
>> even after 1 hour passed it's not finish doing the checkpoint. looking at
>> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
>> progress in size and timestamp of the file is days ago.
>>
>> already tried to clean the current in snn and restart secondarynamenode
>> process, but SNN still stop at the same spot even thought the snn process
>> is still exist.
>>
>>
>> -P
>>
>
>

Re: secondary namenode is hang at post

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks adam,

2013-12-11 14:43:07
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.7-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000041057000 nid=0x6b95 waiting
on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007fb6e0009800 nid=0x1b6f waiting on
condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"SecondaryNameNode Status
Name Node Address    : pnn/x.x.x.x:9000
Start Time           : Wed Dec 11 13:04:22 PST 2013
Last Checkpoint Time : --
Checkpoint Period    : 3600 seconds
Checkpoint Size      : 67108864 MB
Checkpoint Dirs      : [/hadoop/namesecondary, /hadoop/nfs_namesecondary]
Checkpoint Edits Dirs: [/hadoop/namesecondary, /hadoop/nfs_namesecondary]"
daemon prio=10 tid=0x00007fb6e0471800 nid=0x1b95 runnable
[0x00007fad08f43000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00007fb4077b5108> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:690)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
- locked <0x00007fb4077a9218> (a
sun.net.www.protocol.http.HttpURLConnection)
at
sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2300)
at
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:192)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.putFSImage(SecondaryNameNode.java:404)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:473)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:333)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:297)
at java.lang.Thread.run(Thread.java:662)

"Timer-0" daemon prio=10 tid=0x00007fb6e0484000 nid=0x1b94 in Object.wait()
[0x00007fad09044000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007fad3e671138> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)

"1408732008@qtp-421539177-1 - Acceptor0 SelectChannelConnector@snn:50090"
prio=10 tid=0x00007fb6e0478800 nid=0x1b93 runnable [0x00007fad09145000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0x00007fad3e6c8f48> (a sun.nio.ch.Util$2)
- locked <0x00007fad3e6c8f60> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007fad3e6c8ed0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:511)
at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:193)
at
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

"162291475@qtp-421539177-0" prio=10 tid=0x00007fb6e0477800 nid=0x1b92 in
Object.wait() [0x00007fad09246000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00007fad3e6711a0> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"RMI TCP Accept-0" daemon prio=10 tid=0x00007fb6e0322000 nid=0x1b8e
runnable [0x00007fad09cd1000]
   java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
- locked <0x00007fad3e671330> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:662)

"Low Memory Detector" daemon prio=10 tid=0x00007fb6e00b8800 nid=0x1b8d
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007fb6e00b6800 nid=0x1b8c
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007fb6e00b3800 nid=0x1b8b
waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007fb6e00b1800 nid=0x1b8a
runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007fb6e0094800 nid=0x1b89 in
Object.wait() [0x00007fad0aa8d000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00007fad3e671898> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00007fb6e0092800 nid=0x1b88 in
Object.wait() [0x00007fad0ab8e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00007fad3e669458> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x00007fb6e008c000 nid=0x1b87 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fb6e001c800 nid=0x1b70
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fb6e001e800 nid=0x1b71
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fb6e0020000 nid=0x1b72
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fb6e0022000 nid=0x1b73
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fb6e0024000 nid=0x1b74
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fb6e0025800 nid=0x1b75
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fb6e0027800 nid=0x1b76
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fb6e0029800 nid=0x1b77
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fb6e002b000 nid=0x1b78
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fb6e002d000 nid=0x1b79
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007fb6e002f000 nid=0x1b7a
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007fb6e0030800 nid=0x1b7b
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007fb6e0032800 nid=0x1b7c
runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007fb6e0034800 nid=0x1b7d
runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007fb6e0036000 nid=0x1b7e
runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007fb6e0038000 nid=0x1b7f
runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007fb6e003a000 nid=0x1b80
runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007fb6e003b800 nid=0x1b81
runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007fb6e003d800 nid=0x1b82
runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007fb6e003f800 nid=0x1b83
runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007fb6e0041000 nid=0x1b84
runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007fb6e0043000 nid=0x1b85
runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007fb6e0045000 nid=0x1b86
runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fb6e0335800 nid=0x1b8f waiting
on condition

JNI global references: 1363



On Wed, Dec 11, 2013 at 2:40 PM, Adam Kawa <ka...@gmail.com> wrote:

> It looks that it can not copy the new checkpoint into the NameNode. Can
> you copy-past what jstack says?
>
> $ sudo -u hdfs jstack <snn-pid>
>
>
>
> 2013/12/11 Patai Sangbutsarakum <si...@gmail.com>
>
>> It just happens without changing anything in the cluster. Secondary
>> namenode node has been working fine until today i notice that in second
>> namenode log file stop at.
>>
>> 2013-12-11 13:17:41,083 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 61 seconds.
>> 2013-12-11 13:19:15,446 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3941631662
>> saved in 94 seconds.
>> 2013-12-11 13:19:29,760 INFO
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
>> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>>
>>
>> #date
>> Wed Dec 11 14:16:30 PST 2013
>>
>> even after 1 hour passed it's not finish doing the checkpoint. looking at
>> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
>> progress in size and timestamp of the file is days ago.
>>
>> already tried to clean the current in snn and restart secondarynamenode
>> process, but SNN still stop at the same spot even thought the snn process
>> is still exist.
>>
>>
>> -P
>>
>
>

Re: secondary namenode is hang at post

Posted by Adam Kawa <ka...@gmail.com>.
It looks that it can not copy the new checkpoint into the NameNode. Can you
copy-past what jstack says?

$ sudo -u hdfs jstack <snn-pid>



2013/12/11 Patai Sangbutsarakum <si...@gmail.com>

> It just happens without changing anything in the cluster. Secondary
> namenode node has been working fine until today i notice that in second
> namenode log file stop at.
>
> 2013-12-11 13:17:41,083 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 61 seconds.
> 2013-12-11 13:19:15,446 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 94 seconds.
> 2013-12-11 13:19:29,760 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>
>
> #date
> Wed Dec 11 14:16:30 PST 2013
>
> even after 1 hour passed it's not finish doing the checkpoint. looking at
> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
> progress in size and timestamp of the file is days ago.
>
> already tried to clean the current in snn and restart secondarynamenode
> process, but SNN still stop at the same spot even thought the snn process
> is still exist.
>
>
> -P
>

Re: secondary namenode is hang at post

Posted by Adam Kawa <ka...@gmail.com>.
It looks that it can not copy the new checkpoint into the NameNode. Can you
copy-past what jstack says?

$ sudo -u hdfs jstack <snn-pid>



2013/12/11 Patai Sangbutsarakum <si...@gmail.com>

> It just happens without changing anything in the cluster. Secondary
> namenode node has been working fine until today i notice that in second
> namenode log file stop at.
>
> 2013-12-11 13:17:41,083 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 61 seconds.
> 2013-12-11 13:19:15,446 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 94 seconds.
> 2013-12-11 13:19:29,760 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>
>
> #date
> Wed Dec 11 14:16:30 PST 2013
>
> even after 1 hour passed it's not finish doing the checkpoint. looking at
> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
> progress in size and timestamp of the file is days ago.
>
> already tried to clean the current in snn and restart secondarynamenode
> process, but SNN still stop at the same spot even thought the snn process
> is still exist.
>
>
> -P
>

Re: secondary namenode is hang at post

Posted by Adam Kawa <ka...@gmail.com>.
It looks that it can not copy the new checkpoint into the NameNode. Can you
copy-past what jstack says?

$ sudo -u hdfs jstack <snn-pid>



2013/12/11 Patai Sangbutsarakum <si...@gmail.com>

> It just happens without changing anything in the cluster. Secondary
> namenode node has been working fine until today i notice that in second
> namenode log file stop at.
>
> 2013-12-11 13:17:41,083 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 61 seconds.
> 2013-12-11 13:19:15,446 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 94 seconds.
> 2013-12-11 13:19:29,760 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>
>
> #date
> Wed Dec 11 14:16:30 PST 2013
>
> even after 1 hour passed it's not finish doing the checkpoint. looking at
> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
> progress in size and timestamp of the file is days ago.
>
> already tried to clean the current in snn and restart secondarynamenode
> process, but SNN still stop at the same spot even thought the snn process
> is still exist.
>
>
> -P
>

Re: secondary namenode is hang at post

Posted by Adam Kawa <ka...@gmail.com>.
It looks that it can not copy the new checkpoint into the NameNode. Can you
copy-past what jstack says?

$ sudo -u hdfs jstack <snn-pid>



2013/12/11 Patai Sangbutsarakum <si...@gmail.com>

> It just happens without changing anything in the cluster. Secondary
> namenode node has been working fine until today i notice that in second
> namenode log file stop at.
>
> 2013-12-11 13:17:41,083 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 61 seconds.
> 2013-12-11 13:19:15,446 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3941631662 saved in 94 seconds.
> 2013-12-11 13:19:29,760 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> primarynn:50070putimage=1&port=50090&machine=snn2&token=-19:1944437056:1306462484827:1386562944000:1386559939508&newChecksum=2ad3e008becbd2cd5196b2b6ba1498ac
>
>
> #date
> Wed Dec 11 14:16:30 PST 2013
>
> even after 1 hour passed it's not finish doing the checkpoint. looking at
> timestamp of current/fsimage.ckpt at primary namenode; it doesn't show
> progress in size and timestamp of the file is days ago.
>
> already tried to clean the current in snn and restart secondarynamenode
> process, but SNN still stop at the same spot even thought the snn process
> is still exist.
>
>
> -P
>