You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by gurmukh singh <gu...@yahoo.com.INVALID> on 2022/03/01 03:01:12 UTC

Re: Client to namenode Socket timeout exception - connection-pending

 Are you sure that there are no JVM pauses?
Hopefully, you are not using CMS, as JVM heap is much larger then 32GB

    On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Re.
As a note, I managed to got one jstack from the standby namenode when the problem occured.Here was the state of the listener thread for port 8020 :"IPC Server listener on 8020" #122 daemon prio=5 os_prio=0 tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f7a81525618> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
        at org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
        at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)

   Locked ownable synchronizers:
        - None
At this time, listenQueue was full and the number of "SYNs to LISTEN sockets dropped" increased by 6 000 :25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped(...)
25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped

I don't find anything about 0x00007f7a81525618 in the jstack unfortunately.It seems listener thread is waiting for something, but I don't know what for the moment.
Best regards.
T@le

 




Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :

Hello Gurmuck Singh.
Thank you for your answers.


Why 75GB heap size for NN? are you running a very large cluster?50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?50GB approx means 200 million blocks? do you have that many.

I have ~150 millions of blocks/files and I set up this heap following the recommandations here :https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html

The formula is 20 X log base2(n); where n is the number of nodes.So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.

I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185 threads for the ClientRPC server (the one which listens on port 8020)


$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog




net.core.somaxconn= 8432
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 2000



$ netstat -an | grep -c SYN_RECV


$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


I'll check again and get you more information.


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'





Didn't check these logs, going to check them and get back to you.


What is the ZK heap?



Zookeeper heap is 4 GB.





Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


I'm going to check this also.


If you can send me the NN, JN, ZK logs; more than happy to look into it.

I can yes, I just need time to anonymize everything.


Thanks again for your help.

Best regards.

T@le





Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com> a écrit :

 Also, as you are using hive/beeline. You can fetch all the config as:

beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v' > /tmp/BeelineSet.out

Please attach the BeelineSet.out

    On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  on ZK side
Important things:
Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


If you can send me the NN, JN, ZK logs; more than happy to look into it.


   On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  @Tale Hive you provided the details in the first email, missed it.
Can you provide me the output of below from Namenode:

$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog


$ netstat -an | grep -c SYN_RECV

$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'What is the ZK heap?



    On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <gu...@yahoo.com> wrote:  
 
  I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).    
    Why 75GB heap size for NN? are you running a very large cluster?    50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?    50GB approx means 200 million blocks? do you have that many.

I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.    The formula is 20 X log base2(n); where n is the number of nodes.    So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.
I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.

    On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello.
According to what I saw this morning, I can see that I am in the first situation in fact :   
   - Client sends one packet with flag SYN to namenode
   - Namenode sends one packet with flags SYN, ACK to the client
   - Client sends n packets with flags PSH, ACK to the namenode, for each subfolder
   - Namenode sends n packets PSH, ACK to the client, for the content of each subfolder
So the number of (PSH, ACK) packets from the tcpdump pcap file is not what is filling the accept queue of port 8020 ClientRPC server on Namenode.
I'm going to focus on checking the packets with SYN flag which arrive to the namenode.
After that, because the jstack provokes active namenode failover, I don't have many more tracks to follow except increase again the listenQueue, to mitigate the problem, not to solve it.
Best regards.
T@le


Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :

Hello guys.
Still investigating the tcpdump. I don't see a lot of packets with the flag SYN when the listenQueue is full. 
What I see is a lot of packets with the flag "PSH, ACK" with data inside like this :getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol 
/apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>, how many SYN packets will it send to the namenode ? One in total or one by subfolder ?Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this situation :- Client sends one SYN packet to Namenode- Namenode sends one SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) packets to Namenode
Or this situation :- Client sends n SYN packet to Namenode- Namenode sends n SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) 

It would mean an hdfs recursive listing on a path with a lot of subfolders could harm the other clients by sending too many packets to the namenode ?

About the jstack, I tried it on the namenode JVM but it provoked a failover, as the namenode was not answering at all (in particular, no answer to ZKFC), and the jstack never ended, I had to kill it.I don't know if a kill -3 or a jstack -F could help, but at least jstack -F contains less valuable information.

T@le

Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :

If TCP error occurs then you need to check the network metrics. Yes, TCP DUMP can help you. 


Thanks & Regards
Amithsha

On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:

Hello !

@Amith sha 
I checked also the system metrics, nothing wrong in CPU, RAM or IO.The only thing I found was these TCP errors (ListenDrop).
@HKI'm monitoring a lot of JVM metrics like this one : "UnderReplicatedBlocks" in the bean "Hadoop:service=NameNode,name=FSNamesystem".And I found no under replicated blocks when the problem of timeout occurs, unfortunately.Thanks for you advice, in addition to the tcpdump, I'll perform some jstacks to see if I can find what ipc handlers are doing.
Best regards.
T@le






Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :

Hi Tape,Could you please thread dump of namenode process. Could you please check what ipc handlers are doing. 
We faced similar issue when the under replication is high in the cluster due to filesystem wirteLock. 
On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:

Check your system metrics too. 
On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:

Yeah, next step is for me to perform a tcpdump just when the problem occurs.I want to know if my namenode does not accept connections because it freezes for some reasons or because there is too many connections at a time.
My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.And during this time, CallQueue is OK, Heap is OK, I don't find any metrics which could show me a problem inside the namenode JVM.

Best regards.
T@le

Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :

If you still concerned about the delay of > 2 s then you need to do benchmark with and without load. To find the root cause of the problem it will help.
On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:

Hello Amith.
Hm, not a bad idea. If I increase the size of the listenQueue and if I increase timeout, the combination of both may mitigate more the problem than just increasing listenQueue size.It won't solve the problem of acceptance speed, but it could help.
Thanks for the suggestion !
T@le
Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :

org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. 
Connection timed out after 20000 milli sec i suspect this value is very low for a namenode with 75Gb of heap usage. Can you increase the value to 5sec and check the connection. To increase the value modify this property
ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to the core-site.xml)


Thanks & Regards
Amithsha

On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:

Hello Tom.
Sorry for my absence of answers, I don't know why gmail puts your mail into spam -_-.
To answer you :   
   - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC metric are all OK
   - Threads are plenty sufficient (I can see the metrics also for them and I  am below 200, the number I have for 8020 RPC server)
Did you see my other answers about this problem ?I would be interested to have your opinion about that !
Best regards.
T@le


Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :

It might be helpful to analyze namenode metrics and logs.

What about some key metrics? Examples are callQueueLength, avgQueueTime, avgProcessingTime and GC metrics.

In addition, is the number of threads(dfs.namenode.service.handler.count) in the namenode sufficient? 

Hopefully this will help.

Best regards.
Tom

Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:

Hello.
I encounter a strange problem with my namenode. I have the following architecture :- Two namenodes in HA 
- 600 datanodes- HDP 3.1.4- 150 millions of files and folders

Sometimes, when I query the namenode with the hdfs client, I got a timeout error like this :
hdfs dfs -ls -d /user/myuser
22/02/14 15:07:44 INFO retry.RetryInvocationHandler: org.apache.hadoop.net.ConnectTimeoutException: Call From <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020 failed on socket timeout exception: 
  org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; 
  For more details see:  http://wiki.apache.org/hadoop/SocketTimeout, 
while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover attempts. Trying to failover after sleeping for 2694ms. 

I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.General resources seems OK, CPU usage is pretty fine, same for memory, network or IO.No firewall is enabled on my namenodes nor my client.

I was wondering what could cause this problem, please ?
Thank you in advance for your help !

Best regards.
T@le












        

  

Re: Client to namenode Socket timeout exception - connection-pending

Posted by gurmukh singh <gu...@yahoo.com.INVALID>.
 Increasing listen queue will delay the problem but will not solve it. We need to find the root cause. Did you raise any case with Cloudera Support? 

    On Thursday, 17 March, 2022, 07:35:03 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello everyone.
The increase to 8192 for the listen queue makes the number of TCP listenDrops decrease, but I can still see them in my monitoring tool.Furthermore, I can see sometimes that my hdfs dfs -ls command executed every 10s is still slow (it can still go from 2s to 2 min).

I checked again the number of open Connection on port 8020 and I can see it begins to decrease prior to listenDrop occures, which confirm again the problem comes from namenode JVM.

Except jstack is there a way to see in real time the inside of the JVM, without impacting its performance ?
best regards.
T@le


Le jeu. 10 mars 2022 à 12:05, Tale Hive <ta...@gmail.com> a écrit :

Hello Gurmukh Singh.
Hm, there is this information "Technical Preview" in the documentation you mentioned.



In addition, I analyzed some TCPDUMP pcap files (I take "1 minute" samples when the problem occured and when everything was OK) and I saw that the problem is not linked with the number of connections requests. 
This problems occures sometimes even when the number of connection requests is not big. I see SYN packets without SYN-ACK answers when this problem occures.

Furthemore, it is impossible to analyze thread dumps, because when we perform a jstack on standby namenode, the jstack does not respond until the peak of listenDrop is done.So the previous jstack I got didn't help me to understand what is going on from Namenode side.

Because of these information, as it is impossible to know what the namenode is doing when the problem occures, and as we have a lot of free memory on namenode host, we increased again the listenQueue from 4096 to 8192 and we are monitoring the result. 
We still see some listenDrop peaks, but they are a lot less frequent.
Best regards.
T@le


Le mar. 8 mars 2022 à 05:23, gurmukh singh <gu...@yahoo.com> a écrit :

 "G1GC is still technical preview with namenode, isn't it ? No -> not for HDP 3.x

"We have been using it with HBase on HDP and Cloudera clusters for some time and it works well.

It is not in preview for HDP 3.1.Xhttps://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/data-storage/content/recommended_settings_for_g1gc.html
    On Tuesday, 1 March, 2022, 08:15:12 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello.

Are you sure that there are no JVM pauses? 


Yes, I check multiple metrics (used heap and max heap, GcTotalExtraSleepTime and also the logs from the namenode). None of these metrics indicates a JVM pause when the problem occures.

Here is an example of this morning, you can see listenDrops peak on active namenode (yellow) from 09:27:00 to 09:30:00.


And here are the GC logs at this period :2022-03-01T09:26:57.647+0100: 512535.780: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50350809K(70391808K)] 58394222K(79441920K), 0.3132453 secs] [Times: user=4.92 sys=0.00, real=0.31 secs]2022-03-01T09:26:57.961+0100: 512536.093: [CMS-concurrent-mark-start]
2022-03-01T09:26:58.262+0100: 512536.395: [GC (Allocation Failure) 2022-03-01T09:26:58.262+0100: 512536.395: [ParNew: 8094551K->47062K(9050112K), 0.1147194 secs] 58445361K->50399530K(79441920K), 0.1151125 secs] [Times: user=1.69 sys=0.00, real=0.12 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-mark: 35.668/35.942 secs] [Times: user=159.03 sys=19.73, real=35.94 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-preclean-start]
2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-preclean: 0.495/0.511 secs] [Times: user=0.61 sys=0.36, real=0.51 secs]
2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:27:39.517+0100: 512577.650: [CMS-concurrent-abortable-preclean: 5.083/5.103 secs] [Times: user=6.99 sys=2.78, real=5.11 secs]
2022-03-01T09:27:39.526+0100: 512577.659: [GC (CMS Final Remark) [YG occupancy: 4405873 K (9050112 K)]2022-03-01T09:27:39.526+0100: 512577.659: [Rescan (parallel) , 0.3355048 secs]2022-03-01T09:27:39.862+0100: 512577.995: [weak refs processing, 0.0000413 secs]2022-03-01T09:27:39.862+0100: 512577.995: [class unloading, 0.0482024 secs]2022-03-01T09:27:39.910+0100: 512578.043: [scrub symbol table, 0.0147892 secs]2022-03-01T09:27:39.925+0100: 512578.058: [scrub string table, 0.0018140 secs][1 CMS-remark: 50352467K(70391808K)] 54758340K(79441920K), 0.4007414 secs] [Times: user=2.50 sys=2.90, real=0.40 secs]
2022-03-01T09:27:39.927+0100: 512578.060: [CMS-concurrent-sweep-start]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-sweep: 17.672/17.688 secs] [Times: user=23.00 sys=11.53, real=17.68 secs]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-reset-start]
2022-03-01T09:27:57.802+0100: 512595.935: [CMS-concurrent-reset: 0.187/0.187 secs] [Times: user=0.37 sys=0.00, real=0.19 secs]
2022-03-01T09:27:59.811+0100: 512597.944: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50343447K(70391808K)] 56591863K(79441920K), 0.2328587 secs] [Times: user=3.62 sys=0.00, real=0.23 secs]
2022-03-01T09:28:00.044+0100: 512598.177: [CMS-concurrent-mark-start]
2022-03-01T09:28:18.515+0100: 512616.648: [GC (Allocation Failure) 2022-03-01T09:28:18.516+0100: 512616.648: [ParNew: 8091606K->44561K(9050112K), 0.1131251 secs] 58435054K->50390162K(79441920K), 0.1135414 secs] [Times: user=1.63 sys=0.00, real=0.12 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-mark: 35.063/35.319 secs] [Times: user=177.63 sys=5.00, real=35.32 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-preclean-start]
2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-preclean: 0.536/0.555 secs] [Times: user=1.00 sys=0.66, real=0.56 secs]
2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:28:41.059+0100: 512639.191: [CMS-concurrent-abortable-preclean: 5.123/5.140 secs] [Times: user=8.80 sys=4.16, real=5.14 secs]
2022-03-01T09:28:41.067+0100: 512639.200: [GC (CMS Final Remark) [YG occupancy: 3604963 K (9050112 K)]2022-03-01T09:28:41.068+0100: 512639.200: [Rescan (parallel) , 0.1591429 secs]2022-03-01T09:28:41.227+0100: 512639.360: [weak refs processing, 0.0000823 secs]2022-03-01T09:28:41.227+0100: 512639.360: [class unloading, 0.0551753 secs]2022-03-01T09:28:41.282+0100: 512639.415: [scrub symbol table, 0.0148688 secs]2022-03-01T09:28:41.297+0100: 512639.430: [scrub string table, 0.0015506 secs][1 CMS-remark: 50345600K(70391808K)] 53950564K(79441920K), 0.2311903 secs] [Times: user=2.05 sys=0.55, real=0.23 secs]
2022-03-01T09:28:41.299+0100: 512639.432: [CMS-concurrent-sweep-start]
2022-03-01T09:28:57.951+0100: 512656.083: [GC (Allocation Failure) 2022-03-01T09:28:57.951+0100: 512656.084: [ParNew: 8089105K->57885K(9050112K), 0.0995224 secs] 58427961K->50399315K(79441920K), 0.0999603 secs] [Times: user=0.83 sys=0.56, real=0.10 secs]
2022-03-01T09:29:07.938+0100: 512666.071: [GC (Allocation Failure) 2022-03-01T09:29:07.939+0100: 512666.071: [ParNew: 8102429K->87814K(9050112K), 0.1961157 secs] 58440735K->50428503K(79441920K), 0.1965228 secs] [Times: user=2.94 sys=0.00, real=0.19 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-sweep: 26.577/27.071 secs] [Times: user=85.98 sys=29.07, real=27.07 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-reset-start]
2022-03-01T09:29:08.712+0100: 512666.845: [CMS-concurrent-reset: 0.342/0.342 secs] [Times: user=2.13 sys=0.00, real=0.34 secs]
2022-03-01T09:29:10.721+0100: 512668.854: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50340545K(70391808K)] 52645683K(79441920K), 0.0927209 secs] [Times: user=1.42 sys=0.00, real=0.09 secs]
2022-03-01T09:29:10.814+0100: 512668.947: [CMS-concurrent-mark-start]
2022-03-01T09:29:16.975+0100: 512675.108: [GC (Allocation Failure) 2022-03-01T09:29:16.975+0100: 512675.108: [ParNew: 8132358K->90327K(9050112K), 0.0810929 secs] 58472904K->50433071K(79441920K), 0.0815880 secs] [Times: user=1.15 sys=0.00, real=0.09 secs]
2022-03-01T09:29:24.290+0100: 512682.423: [GC (Allocation Failure) 2022-03-01T09:29:24.290+0100: 512682.423: [ParNew: 8134871K->101007K(9050112K), 0.0819188 secs] 58477615K->50446660K(79441920K), 0.0823128 secs] [Times: user=0.79 sys=0.33, real=0.08 secs]
2022-03-01T09:29:31.967+0100: 512690.100: [GC (Allocation Failure) 2022-03-01T09:29:31.968+0100: 512690.100: [ParNew: 8145551K->85497K(9050112K), 0.0818475 secs] 58491204K->50434672K(79441920K), 0.0822342 secs] [Times: user=0.75 sys=0.36, real=0.08 secs]
2022-03-01T09:29:38.683+0100: 512696.815: [GC (Allocation Failure) 2022-03-01T09:29:38.683+0100: 512696.816: [ParNew: 8130041K->96086K(9050112K), 0.0877875 secs] 58479216K->50447377K(79441920K), 0.0881567 secs] [Times: user=1.26 sys=0.00, real=0.09 secs]
2022-03-01T09:29:48.960+0100: 512707.092: [CMS-concurrent-mark: 37.712/38.145 secs] [Times: user=320.76 sys=50.98, real=38.15 secs]
2022-03-01T09:29:48.960+0100: 512707.093: [CMS-concurrent-preclean-start]
2022-03-01T09:29:49.271+0100: 512707.404: [GC (Allocation Failure) 2022-03-01T09:29:49.271+0100: 512707.404: [ParNew: 8140630K->83188K(9050112K), 0.2629305 secs] 58491921K->50446168K(79441920K), 0.2633249 secs] [Times: user=0.44 sys=3.02, real=0.26 secs]
2022-03-01T09:29:50.420+0100: 512708.552: [CMS-concurrent-preclean: 1.089/1.460 secs] [Times: user=5.27 sys=5.95, real=1.46 secs]
2022-03-01T09:29:50.420+0100: 512708.553: [CMS-concurrent-abortable-preclean-start]
2022-03-01T09:29:54.266+0100: 512712.398: [CMS-concurrent-abortable-preclean: 3.784/3.846 secs] [Times: user=17.07 sys=11.10, real=3.85 secs]
2022-03-01T09:29:54.290+0100: 512712.422: [GC (CMS Final Remark) [YG occupancy: 4363915 K (9050112 K)]2022-03-01T09:29:54.290+0100: 512712.422: [Rescan (parallel) , 0.2017093 secs]2022-03-01T09:29:54.491+0100: 512712.624: [weak refs processing, 0.0000442 secs]2022-03-01T09:29:54.491+0100: 512712.624: [class unloading, 0.0467518 secs]2022-03-01T09:29:54.538+0100: 512712.671: [scrub symbol table, 0.0152955 secs]2022-03-01T09:29:54.554+0100: 512712.686: [scrub string table, 0.0014710 secs][1 CMS-remark: 50362980K(70391808K)] 54726895K(79441920K), 0.2655851 secs] [Times: user=2.69 sys=0.60, real=0.26 secs]
2022-03-01T09:29:54.556+0100: 512712.688: [CMS-concurrent-sweep-start]
2022-03-01T09:29:59.109+0100: 512717.242: [GC (Allocation Failure) 2022-03-01T09:29:59.109+0100: 512717.242: [ParNew: 8127732K->87947K(9050112K), 0.0986180 secs] 58486527K->50450553K(79441920K), 0.0990293 secs] [Times: user=0.97 sys=0.35, real=0.10 secs]
2022-03-01T09:30:10.092+0100: 512728.225: [GC (Allocation Failure) 2022-03-01T09:30:10.092+0100: 512728.225: [ParNew: 8132491K->90112K(9050112K), 0.0823441 secs] 58490044K->50450110K(79441920K), 0.0828723 secs] [Times: user=1.11 sys=0.00, real=0.08 secs]
2022-03-01T09:30:18.262+0100: 512736.395: [GC (Allocation Failure) 2022-03-01T09:30:18.262+0100: 512736.395: [ParNew: 8134656K->97101K(9050112K), 0.0817780 secs] 58490119K->50455026K(79441920K), 0.0821940 secs] [Times: user=1.13 sys=0.00, real=0.08 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-sweep: 27.023/27.425 secs] [Times: user=134.40 sys=34.43, real=27.42 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-reset-start]
2022-03-01T09:30:22.164+0100: 512740.297: [CMS-concurrent-reset: 0.183/0.183 secs] [Times: user=0.50 sys=0.00, real=0.18 secs]
2022-03-01T09:30:24.171+0100: 512742.304: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50356777K(70391808K)] 54936817K(79441920K), 0.2049809 secs] [Times: user=3.17 sys=0.00, real=0.21 secs]

Knowing that the following event are stop-the-world with CMS (initial mark and remark) :


Hopefully, you are not using CMS, as JVM heap is much larger then 32GB 


I am using CMS indeed, because G1GC is still technical preview with namenode, isn't it ?

Best regards.
T@le


Le mar. 1 mars 2022 à 04:01, gurmukh singh <gu...@yahoo.com> a écrit :

 Are you sure that there are no JVM pauses?
Hopefully, you are not using CMS, as JVM heap is much larger then 32GB

    On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Re.
As a note, I managed to got one jstack from the standby namenode when the problem occured.Here was the state of the listener thread for port 8020 :"IPC Server listener on 8020" #122 daemon prio=5 os_prio=0 tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f7a81525618> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
        at org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
        at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)

   Locked ownable synchronizers:
        - None
At this time, listenQueue was full and the number of "SYNs to LISTEN sockets dropped" increased by 6 000 :25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped(...)
25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped

I don't find anything about 0x00007f7a81525618 in the jstack unfortunately.It seems listener thread is waiting for something, but I don't know what for the moment.
Best regards.
T@le

 




Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :

Hello Gurmuck Singh.
Thank you for your answers.


Why 75GB heap size for NN? are you running a very large cluster?50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?50GB approx means 200 million blocks? do you have that many.

I have ~150 millions of blocks/files and I set up this heap following the recommandations here :https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html

The formula is 20 X log base2(n); where n is the number of nodes.So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.

I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185 threads for the ClientRPC server (the one which listens on port 8020)


$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog




net.core.somaxconn= 8432
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 2000



$ netstat -an | grep -c SYN_RECV


$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


I'll check again and get you more information.


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'





Didn't check these logs, going to check them and get back to you.


What is the ZK heap?



Zookeeper heap is 4 GB.





Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


I'm going to check this also.


If you can send me the NN, JN, ZK logs; more than happy to look into it.

I can yes, I just need time to anonymize everything.


Thanks again for your help.

Best regards.

T@le





Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com> a écrit :

 Also, as you are using hive/beeline. You can fetch all the config as:

beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v' > /tmp/BeelineSet.out

Please attach the BeelineSet.out

    On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  on ZK side
Important things:
Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


If you can send me the NN, JN, ZK logs; more than happy to look into it.


   On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  @Tale Hive you provided the details in the first email, missed it.
Can you provide me the output of below from Namenode:

$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog


$ netstat -an | grep -c SYN_RECV

$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'What is the ZK heap?



    On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <gu...@yahoo.com> wrote:  
 
  I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).    
    Why 75GB heap size for NN? are you running a very large cluster?    50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?    50GB approx means 200 million blocks? do you have that many.

I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.    The formula is 20 X log base2(n); where n is the number of nodes.    So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.
I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.

    On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello.
According to what I saw this morning, I can see that I am in the first situation in fact :   
   - Client sends one packet with flag SYN to namenode
   - Namenode sends one packet with flags SYN, ACK to the client
   - Client sends n packets with flags PSH, ACK to the namenode, for each subfolder
   - Namenode sends n packets PSH, ACK to the client, for the content of each subfolder
So the number of (PSH, ACK) packets from the tcpdump pcap file is not what is filling the accept queue of port 8020 ClientRPC server on Namenode.
I'm going to focus on checking the packets with SYN flag which arrive to the namenode.
After that, because the jstack provokes active namenode failover, I don't have many more tracks to follow except increase again the listenQueue, to mitigate the problem, not to solve it.
Best regards.
T@le


Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :

Hello guys.
Still investigating the tcpdump. I don't see a lot of packets with the flag SYN when the listenQueue is full. 
What I see is a lot of packets with the flag "PSH, ACK" with data inside like this :getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol 
/apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>, how many SYN packets will it send to the namenode ? One in total or one by subfolder ?Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this situation :- Client sends one SYN packet to Namenode- Namenode sends one SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) packets to Namenode
Or this situation :- Client sends n SYN packet to Namenode- Namenode sends n SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) 

It would mean an hdfs recursive listing on a path with a lot of subfolders could harm the other clients by sending too many packets to the namenode ?

About the jstack, I tried it on the namenode JVM but it provoked a failover, as the namenode was not answering at all (in particular, no answer to ZKFC), and the jstack never ended, I had to kill it.I don't know if a kill -3 or a jstack -F could help, but at least jstack -F contains less valuable information.

T@le

Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :

If TCP error occurs then you need to check the network metrics. Yes, TCP DUMP can help you. 


Thanks & Regards
Amithsha

On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:

Hello !

@Amith sha 
I checked also the system metrics, nothing wrong in CPU, RAM or IO.The only thing I found was these TCP errors (ListenDrop).
@HKI'm monitoring a lot of JVM metrics like this one : "UnderReplicatedBlocks" in the bean "Hadoop:service=NameNode,name=FSNamesystem".And I found no under replicated blocks when the problem of timeout occurs, unfortunately.Thanks for you advice, in addition to the tcpdump, I'll perform some jstacks to see if I can find what ipc handlers are doing.
Best regards.
T@le






Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :

Hi Tape,Could you please thread dump of namenode process. Could you please check what ipc handlers are doing. 
We faced similar issue when the under replication is high in the cluster due to filesystem wirteLock. 
On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:

Check your system metrics too. 
On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:

Yeah, next step is for me to perform a tcpdump just when the problem occurs.I want to know if my namenode does not accept connections because it freezes for some reasons or because there is too many connections at a time.
My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.And during this time, CallQueue is OK, Heap is OK, I don't find any metrics which could show me a problem inside the namenode JVM.

Best regards.
T@le

Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :

If you still concerned about the delay of > 2 s then you need to do benchmark with and without load. To find the root cause of the problem it will help.
On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:

Hello Amith.
Hm, not a bad idea. If I increase the size of the listenQueue and if I increase timeout, the combination of both may mitigate more the problem than just increasing listenQueue size.It won't solve the problem of acceptance speed, but it could help.
Thanks for the suggestion !
T@le
Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :

org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. 
Connection timed out after 20000 milli sec i suspect this value is very low for a namenode with 75Gb of heap usage. Can you increase the value to 5sec and check the connection. To increase the value modify this property
ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to the core-site.xml)


Thanks & Regards
Amithsha

On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:

Hello Tom.
Sorry for my absence of answers, I don't know why gmail puts your mail into spam -_-.
To answer you :   
   - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC metric are all OK
   - Threads are plenty sufficient (I can see the metrics also for them and I  am below 200, the number I have for 8020 RPC server)
Did you see my other answers about this problem ?I would be interested to have your opinion about that !
Best regards.
T@le


Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :

It might be helpful to analyze namenode metrics and logs.

What about some key metrics? Examples are callQueueLength, avgQueueTime, avgProcessingTime and GC metrics.

In addition, is the number of threads(dfs.namenode.service.handler.count) in the namenode sufficient? 

Hopefully this will help.

Best regards.
Tom

Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:

Hello.
I encounter a strange problem with my namenode. I have the following architecture :- Two namenodes in HA 
- 600 datanodes- HDP 3.1.4- 150 millions of files and folders

Sometimes, when I query the namenode with the hdfs client, I got a timeout error like this :
hdfs dfs -ls -d /user/myuser
22/02/14 15:07:44 INFO retry.RetryInvocationHandler: org.apache.hadoop.net.ConnectTimeoutException: Call From <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020 failed on socket timeout exception: 
  org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; 
  For more details see:  http://wiki.apache.org/hadoop/SocketTimeout, 
while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover attempts. Trying to failover after sleeping for 2694ms. 

I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.General resources seems OK, CPU usage is pretty fine, same for memory, network or IO.No firewall is enabled on my namenodes nor my client.

I was wondering what could cause this problem, please ?
Thank you in advance for your help !

Best regards.
T@le












        

  
  

  

Re: Client to namenode Socket timeout exception - connection-pending

Posted by Tale Hive <ta...@gmail.com>.
Hello everyone.

The increase to 8192 for the listen queue makes the number of TCP
listenDrops decrease, but I can still see them in my monitoring tool.
Furthermore, I can see sometimes that my hdfs dfs -ls command executed
every 10s is still slow (it can still go from 2s to 2 min).

I checked again the number of open Connection on port 8020 and I can see it
begins to decrease prior to listenDrop occures, which confirm again the
problem comes from namenode JVM.

Except jstack is there a way to see in real time the inside of the JVM,
without impacting its performance ?

best regards.

T@le


Le jeu. 10 mars 2022 à 12:05, Tale Hive <ta...@gmail.com> a écrit :

> Hello Gurmukh Singh.
>
> Hm, there is this information "Technical Preview" in the documentation you
> mentioned.
>
> [image: image.png]
>
>
> In addition, I analyzed some TCPDUMP pcap files (I take "1 minute" samples
> when the problem occured and when everything was OK) and I saw that the
> problem is not linked with the number of connections requests.
> This problems occures sometimes even when the number of connection
> requests is not big. I see SYN packets without SYN-ACK answers when this
> problem occures.
>
> Furthemore, it is impossible to analyze thread dumps, because when we
> perform a jstack on standby namenode, the jstack does not respond until the
> peak of listenDrop is done.
> So the previous jstack I got didn't help me to understand what is going on
> from Namenode side.
>
> Because of these information, as it is impossible to know what the
> namenode is doing when the problem occures, and as we have a lot of free
> memory on namenode host, we increased again the listenQueue from 4096 to
> 8192 and we are monitoring the result.
>
> We still see some listenDrop peaks, but they are a lot less frequent.
>
> Best regards.
>
> T@le
>
>
> Le mar. 8 mars 2022 à 05:23, gurmukh singh <gu...@yahoo.com> a
> écrit :
>
>> "G1GC is still technical preview with namenode, isn't it ? No -> not for
>> HDP 3.x
>>
>> "We have been using it with HBase on HDP and Cloudera clusters for some
>> time and it works well.
>>
>> It is not in preview for HDP 3.1.X
>>
>> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/data-storage/content/recommended_settings_for_g1gc.html
>>
>> On Tuesday, 1 March, 2022, 08:15:12 pm GMT+11, Tale Hive <
>> tale2.hive@gmail.com> wrote:
>>
>>
>> Hello.
>>
>> Are you sure that there are no JVM pauses?
>>
>>
>> Yes, I check multiple metrics (used heap and max heap,
>> GcTotalExtraSleepTime and also the logs from the namenode). None of these
>> metrics indicates a JVM pause when the problem occures.
>>
>> Here is an example of this morning, you can see listenDrops peak on
>> active namenode (yellow) from 09:27:00 to 09:30:00.
>> [image: image.png]
>>
>> And here are the GC logs at this period :
>> 2022-03-01T09:26:57.647+0100: 512535.780: [GC (CMS Initial Mark) [1
>> CMS-initial-mark: 50350809K(70391808K)] 58394222K(79441920K), 0.3132453
>> secs] [Times: user=4.92 sys=0.00, real=0.31 secs]
>> 2022-03-01T09:26:57.961+0100: 512536.093: [CMS-concurrent-mark-start]
>> 2022-03-01T09:26:58.262+0100: 512536.395: [GC (Allocation Failure)
>> 2022-03-01T09:26:58.262+0100: 512536.395: [ParNew:
>> 8094551K->47062K(9050112K), 0.1147194 secs]
>> 58445361K->50399530K(79441920K), 0.1151125 secs] [Times: user=1.69
>> sys=0.00, real=0.12 secs]
>> 2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-mark:
>> 35.668/35.942 secs] [Times: user=159.03 sys=19.73, real=35.94 secs]
>> 2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-preclean-start]
>> 2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-preclean:
>> 0.495/0.511 secs] [Times: user=0.61 sys=0.36, real=0.51 secs]
>> 2022-03-01T09:27:34.414+0100: 512572.547:
>> [CMS-concurrent-abortable-preclean-start]
>>  CMS: abort preclean due to time 2022-03-01T09:27:39.517+0100:
>> 512577.650: [CMS-concurrent-abortable-preclean: 5.083/5.103 secs] [Times:
>> user=6.99 sys=2.78, real=5.11 secs]
>> 2022-03-01T09:27:39.526+0100: 512577.659: [GC (CMS Final Remark) [YG
>> occupancy: 4405873 K (9050112 K)]2022-03-01T09:27:39.526+0100: 512577.659:
>> [Rescan (parallel) , 0.3355048 secs]2022-03-01T09:27:39.862+0100:
>> 512577.995: [weak refs processing, 0.0000413
>> secs]2022-03-01T09:27:39.862+0100: 512577.995: [class unloading, 0.0482024
>> secs]2022-03-01T09:27:39.910+0100: 512578.043: [scrub symbol table,
>> 0.0147892 secs]2022-03-01T09:27:39.925+0100: 512578.058: [scrub string
>> table, 0.0018140 secs][1 CMS-remark: 50352467K(70391808K)]
>> 54758340K(79441920K), 0.4007414 secs] [Times: user=2.50 sys=2.90, real=0.40
>> secs]
>> 2022-03-01T09:27:39.927+0100: 512578.060: [CMS-concurrent-sweep-start]
>> 2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-sweep:
>> 17.672/17.688 secs] [Times: user=23.00 sys=11.53, real=17.68 secs]
>> 2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-reset-start]
>> 2022-03-01T09:27:57.802+0100: 512595.935: [CMS-concurrent-reset:
>> 0.187/0.187 secs] [Times: user=0.37 sys=0.00, real=0.19 secs]
>> 2022-03-01T09:27:59.811+0100: 512597.944: [GC (CMS Initial Mark) [1
>> CMS-initial-mark: 50343447K(70391808K)] 56591863K(79441920K), 0.2328587
>> secs] [Times: user=3.62 sys=0.00, real=0.23 secs]
>> 2022-03-01T09:28:00.044+0100: 512598.177: [CMS-concurrent-mark-start]
>> 2022-03-01T09:28:18.515+0100: 512616.648: [GC (Allocation Failure)
>> 2022-03-01T09:28:18.516+0100: 512616.648: [ParNew:
>> 8091606K->44561K(9050112K), 0.1131251 secs]
>> 58435054K->50390162K(79441920K), 0.1135414 secs] [Times: user=1.63
>> sys=0.00, real=0.12 secs]
>> 2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-mark:
>> 35.063/35.319 secs] [Times: user=177.63 sys=5.00, real=35.32 secs]
>> 2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-preclean-start]
>> 2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-preclean:
>> 0.536/0.555 secs] [Times: user=1.00 sys=0.66, real=0.56 secs]
>> 2022-03-01T09:28:35.919+0100: 512634.052:
>> [CMS-concurrent-abortable-preclean-start]
>>  CMS: abort preclean due to time 2022-03-01T09:28:41.059+0100:
>> 512639.191: [CMS-concurrent-abortable-preclean: 5.123/5.140 secs] [Times:
>> user=8.80 sys=4.16, real=5.14 secs]
>> 2022-03-01T09:28:41.067+0100: 512639.200: [GC (CMS Final Remark) [YG
>> occupancy: 3604963 K (9050112 K)]2022-03-01T09:28:41.068+0100: 512639.200:
>> [Rescan (parallel) , 0.1591429 secs]2022-03-01T09:28:41.227+0100:
>> 512639.360: [weak refs processing, 0.0000823
>> secs]2022-03-01T09:28:41.227+0100: 512639.360: [class unloading, 0.0551753
>> secs]2022-03-01T09:28:41.282+0100: 512639.415: [scrub symbol table,
>> 0.0148688 secs]2022-03-01T09:28:41.297+0100: 512639.430: [scrub string
>> table, 0.0015506 secs][1 CMS-remark: 50345600K(70391808K)]
>> 53950564K(79441920K), 0.2311903 secs] [Times: user=2.05 sys=0.55, real=0.23
>> secs]
>> 2022-03-01T09:28:41.299+0100: 512639.432: [CMS-concurrent-sweep-start]
>> 2022-03-01T09:28:57.951+0100: 512656.083: [GC (Allocation Failure)
>> 2022-03-01T09:28:57.951+0100: 512656.084: [ParNew:
>> 8089105K->57885K(9050112K), 0.0995224 secs]
>> 58427961K->50399315K(79441920K), 0.0999603 secs] [Times: user=0.83
>> sys=0.56, real=0.10 secs]
>> 2022-03-01T09:29:07.938+0100: 512666.071: [GC (Allocation Failure)
>> 2022-03-01T09:29:07.939+0100: 512666.071: [ParNew:
>> 8102429K->87814K(9050112K), 0.1961157 secs]
>> 58440735K->50428503K(79441920K), 0.1965228 secs] [Times: user=2.94
>> sys=0.00, real=0.19 secs]
>> 2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-sweep:
>> 26.577/27.071 secs] [Times: user=85.98 sys=29.07, real=27.07 secs]
>> 2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-reset-start]
>> 2022-03-01T09:29:08.712+0100: 512666.845: [CMS-concurrent-reset:
>> 0.342/0.342 secs] [Times: user=2.13 sys=0.00, real=0.34 secs]
>> 2022-03-01T09:29:10.721+0100: 512668.854: [GC (CMS Initial Mark) [1
>> CMS-initial-mark: 50340545K(70391808K)] 52645683K(79441920K), 0.0927209
>> secs] [Times: user=1.42 sys=0.00, real=0.09 secs]
>> 2022-03-01T09:29:10.814+0100: 512668.947: [CMS-concurrent-mark-start]
>> 2022-03-01T09:29:16.975+0100: 512675.108: [GC (Allocation Failure)
>> 2022-03-01T09:29:16.975+0100: 512675.108: [ParNew:
>> 8132358K->90327K(9050112K), 0.0810929 secs]
>> 58472904K->50433071K(79441920K), 0.0815880 secs] [Times: user=1.15
>> sys=0.00, real=0.09 secs]
>> 2022-03-01T09:29:24.290+0100: 512682.423: [GC (Allocation Failure)
>> 2022-03-01T09:29:24.290+0100: 512682.423: [ParNew:
>> 8134871K->101007K(9050112K), 0.0819188 secs]
>> 58477615K->50446660K(79441920K), 0.0823128 secs] [Times: user=0.79
>> sys=0.33, real=0.08 secs]
>> 2022-03-01T09:29:31.967+0100: 512690.100: [GC (Allocation Failure)
>> 2022-03-01T09:29:31.968+0100: 512690.100: [ParNew:
>> 8145551K->85497K(9050112K), 0.0818475 secs]
>> 58491204K->50434672K(79441920K), 0.0822342 secs] [Times: user=0.75
>> sys=0.36, real=0.08 secs]
>> 2022-03-01T09:29:38.683+0100: 512696.815: [GC (Allocation Failure)
>> 2022-03-01T09:29:38.683+0100: 512696.816: [ParNew:
>> 8130041K->96086K(9050112K), 0.0877875 secs]
>> 58479216K->50447377K(79441920K), 0.0881567 secs] [Times: user=1.26
>> sys=0.00, real=0.09 secs]
>> 2022-03-01T09:29:48.960+0100: 512707.092: [CMS-concurrent-mark:
>> 37.712/38.145 secs] [Times: user=320.76 sys=50.98, real=38.15 secs]
>> 2022-03-01T09:29:48.960+0100: 512707.093: [CMS-concurrent-preclean-start]
>> 2022-03-01T09:29:49.271+0100: 512707.404: [GC (Allocation Failure)
>> 2022-03-01T09:29:49.271+0100: 512707.404: [ParNew:
>> 8140630K->83188K(9050112K), 0.2629305 secs]
>> 58491921K->50446168K(79441920K), 0.2633249 secs] [Times: user=0.44
>> sys=3.02, real=0.26 secs]
>> 2022-03-01T09:29:50.420+0100: 512708.552: [CMS-concurrent-preclean:
>> 1.089/1.460 secs] [Times: user=5.27 sys=5.95, real=1.46 secs]
>> 2022-03-01T09:29:50.420+0100: 512708.553:
>> [CMS-concurrent-abortable-preclean-start]
>> 2022-03-01T09:29:54.266+0100: 512712.398:
>> [CMS-concurrent-abortable-preclean: 3.784/3.846 secs] [Times: user=17.07
>> sys=11.10, real=3.85 secs]
>> 2022-03-01T09:29:54.290+0100: 512712.422: [GC (CMS Final Remark) [YG
>> occupancy: 4363915 K (9050112 K)]2022-03-01T09:29:54.290+0100: 512712.422:
>> [Rescan (parallel) , 0.2017093 secs]2022-03-01T09:29:54.491+0100:
>> 512712.624: [weak refs processing, 0.0000442
>> secs]2022-03-01T09:29:54.491+0100: 512712.624: [class unloading, 0.0467518
>> secs]2022-03-01T09:29:54.538+0100: 512712.671: [scrub symbol table,
>> 0.0152955 secs]2022-03-01T09:29:54.554+0100: 512712.686: [scrub string
>> table, 0.0014710 secs][1 CMS-remark: 50362980K(70391808K)]
>> 54726895K(79441920K), 0.2655851 secs] [Times: user=2.69 sys=0.60, real=0.26
>> secs]
>> 2022-03-01T09:29:54.556+0100: 512712.688: [CMS-concurrent-sweep-start]
>> 2022-03-01T09:29:59.109+0100: 512717.242: [GC (Allocation Failure)
>> 2022-03-01T09:29:59.109+0100: 512717.242: [ParNew:
>> 8127732K->87947K(9050112K), 0.0986180 secs]
>> 58486527K->50450553K(79441920K), 0.0990293 secs] [Times: user=0.97
>> sys=0.35, real=0.10 secs]
>> 2022-03-01T09:30:10.092+0100: 512728.225: [GC (Allocation Failure)
>> 2022-03-01T09:30:10.092+0100: 512728.225: [ParNew:
>> 8132491K->90112K(9050112K), 0.0823441 secs]
>> 58490044K->50450110K(79441920K), 0.0828723 secs] [Times: user=1.11
>> sys=0.00, real=0.08 secs]
>> 2022-03-01T09:30:18.262+0100: 512736.395: [GC (Allocation Failure)
>> 2022-03-01T09:30:18.262+0100: 512736.395: [ParNew:
>> 8134656K->97101K(9050112K), 0.0817780 secs]
>> 58490119K->50455026K(79441920K), 0.0821940 secs] [Times: user=1.13
>> sys=0.00, real=0.08 secs]
>> 2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-sweep:
>> 27.023/27.425 secs] [Times: user=134.40 sys=34.43, real=27.42 secs]
>> 2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-reset-start]
>> 2022-03-01T09:30:22.164+0100: 512740.297: [CMS-concurrent-reset:
>> 0.183/0.183 secs] [Times: user=0.50 sys=0.00, real=0.18 secs]
>> 2022-03-01T09:30:24.171+0100: 512742.304: [GC (CMS Initial Mark) [1
>> CMS-initial-mark: 50356777K(70391808K)] 54936817K(79441920K), 0.2049809
>> secs] [Times: user=3.17 sys=0.00, real=0.21 secs]
>>
>> Knowing that the following event are stop-the-world with CMS (initial
>> mark and remark) :
>> [image: image.png]
>>
>> Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>>
>>
>> I am using CMS indeed, because G1GC is still technical preview with
>> namenode, isn't it ?
>>
>> Best regards.
>>
>> T@le
>>
>>
>> Le mar. 1 mars 2022 à 04:01, gurmukh singh <gu...@yahoo.com> a
>> écrit :
>>
>> Are you sure that there are no JVM pauses?
>>
>> Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>>
>> On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <
>> tale2.hive@gmail.com> wrote:
>>
>>
>> Re.
>>
>> As a note, I managed to got one jstack from the standby namenode when the
>> problem occured.
>> Here was the state of the listener thread for port 8020 :
>> "IPC Server listener on 8020" #122 daemon prio=5 os_prio=0
>> tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
>>    java.lang.Thread.State: WAITING (parking)
>>         at sun.misc.Unsafe.park(Native Method)
>>         - parking to wait for  <0x00007f7a81525618> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>         at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>         at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>>         at
>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
>>         at
>> org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
>>         at
>> org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
>>         at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> At this time, listenQueue was full and the number of "SYNs to LISTEN
>> sockets dropped" increased by 6 000 :
>> 25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped
>> (...)
>> 25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped
>>
>> I don't find anything about 0x00007f7a81525618 in the jstack
>> unfortunately.
>> It seems listener thread is waiting for something, but I don't know what
>> for the moment.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>>
>>
>>
>>
>> Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :
>>
>> Hello Gurmuck Singh.
>>
>> Thank you for your answers.
>>
>> Why 75GB heap size for NN? are you running a very large cluster?
>> 50 GB of heap used? Can you check are talking about the NN heap itself or
>> are you saying about the total mem used on the server?
>> 50GB approx means 200 million blocks? do you have that many.
>>
>>
>> I have ~150 millions of blocks/files and I set up this heap following the
>> recommandations here :
>>
>> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html
>>
>> The formula is 20 X log base2(n); where n is the number of nodes.
>> So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200)
>> and then approx 20 threads per thousand nodes.
>>
>>
>> I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185
>> threads for the ClientRPC server (the one which listens on port 8020)
>>
>> $ sysctl -n net.core.somaxconn
>>
>> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>>
>> $ sysctl -n net.core.netdev_max_backlog
>>
>>
>> net.core.somaxconn= 8432
>> net.ipv4.tcp_max_syn_backlog = 4096
>> net.core.netdev_max_backlog = 2000
>>
>>
>> $ netstat -an | grep -c SYN_RECV
>>
>> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>>
>>
>> I'll check again and get you more information.
>>
>> What do you see in the JN logs? and what about ZK logs?
>> any logs in NN, ZK on the lines of "Slow sync'
>>
>> Didn't check these logs, going to check them and get back to you.
>>
>> What is the ZK heap?
>>
>> Zookeeper heap is 4 GB.
>>
>> Disk latency
>> Heap
>> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
>> a high job workload)
>> jute.maxbuffer=1GB (By default it is very low, especially in a
>> kerberozied env, it must be bumped up). This setting is not there in HDP by
>> default, you have to put under custom-zoo.cfg
>>
>>
>> I'm going to check this also.
>>
>> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>>
>> I can yes, I just need time to anonymize everything.
>>
>> Thanks again for your help.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>> Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com>
>> a écrit :
>>
>> Also, as you are using hive/beeline. You can fetch all the config as:
>>
>> beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v'
>> > /tmp/BeelineSet.out
>>
>> Please attach the BeelineSet.out
>>
>> On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh
>> <gu...@yahoo.com.invalid> wrote:
>>
>>
>> on ZK side
>>
>> Important things:
>>
>> Disk latency
>> Heap
>> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
>> a high job workload)
>> jute.maxbuffer=1GB (By default it is very low, especially in a
>> kerberozied env, it must be bumped up). This setting is not there in HDP by
>> default, you have to put under custom-zoo.cfg
>>
>>
>> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>>
>>
>>
>> On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh
>> <gu...@yahoo.com.invalid> wrote:
>>
>>
>> @Tale Hive you provided the details in the first email, missed it.
>>
>> Can you provide me the output of below from Namenode:
>>
>> $ sysctl -n net.core.somaxconn
>>
>> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>>
>> $ sysctl -n net.core.netdev_max_backlog
>>
>> $ netstat -an | grep -c SYN_RECV
>>
>> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>>
>>
>> What do you see in the JN logs? and what about ZK logs?
>> any logs in NN, ZK on the lines of "Slow sync'
>> What is the ZK heap?
>>
>>
>>
>> On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <
>> gurmukh.dhillon@yahoo.com> wrote:
>>
>>
>> I checked the heap of the namenode and there is no problem (I have 75 GB
>> of max heap, I'm around 50 used GB).
>>
>>     Why 75GB heap size for NN? are you running a very large cluster?
>>     50 GB of heap used? Can you check are talking about the NN heap
>> itself or are you saying about the total mem used on the server?
>>     50GB approx means 200 million blocks? do you have that many.
>>
>> I checked the threads of the clientRPC for the namenode and I'm at 200
>> which respects the recommandations from hadoop operations book.
>>     The formula is 20 X log base2(n); where n is the number of nodes.
>>     So, if you have a thousand nodes we keep it to 200 (20 X
>> log2(1024)=200) and then approx 20 threads per thousand nodes.
>>
>> I have serviceRPC enabled to prevent any problem which could be coming
>> from datanodes or ZKFC.
>>
>>
>> On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <
>> tale2.hive@gmail.com> wrote:
>>
>>
>> Hello.
>>
>> According to what I saw this morning, I can see that I am in the first
>> situation in fact :
>>
>>    - Client sends one packet with flag SYN to namenode
>>    - Namenode sends one packet with flags SYN, ACK to the client
>>    - Client sends n packets with flags PSH, ACK to the namenode, for
>>    each subfolder
>>    - Namenode sends n packets PSH, ACK to the client, for the content of
>>    each subfolder
>>
>> So the number of (PSH, ACK) packets from the tcpdump pcap file is not
>> what is filling the accept queue of port 8020 ClientRPC server on Namenode.
>>
>> I'm going to focus on checking the packets with SYN flag which arrive to
>> the namenode.
>> After that, because the jstack provokes active namenode failover, I don't
>> have many more tracks to follow except increase again the listenQueue, to
>> mitigate the problem, not to solve it.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>> Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :
>>
>> Hello guys.
>>
>> Still investigating the tcpdump. I don't see a lot of packets with the
>> flag SYN when the listenQueue is full.
>> What I see is a lot of packets with the flag "PSH, ACK" with data inside
>> like this :
>> getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol
>> /apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
>>
>> It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>,
>> how many SYN packets will it send to the namenode ? One in total or one by
>> subfolder ?
>> Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this
>> situation :
>> - Client sends one SYN packet to Namenode
>> - Namenode sends one SYN-ACK packets to client
>> - Client sends n ACK or (PSH, ACK) packets to Namenode
>>
>> Or this situation :
>> - Client sends n SYN packet to Namenode
>> - Namenode sends n SYN-ACK packets to client
>> - Client sends n ACK or (PSH, ACK)
>>
>> It would mean an hdfs recursive listing on a path with a lot of
>> subfolders could harm the other clients by sending too many packets to the
>> namenode ?
>>
>> About the jstack, I tried it on the namenode JVM but it provoked a
>> failover, as the namenode was not answering at all (in particular, no
>> answer to ZKFC), and the jstack never ended, I had to kill it.
>> I don't know if a kill -3 or a jstack -F could help, but at least jstack
>> -F contains less valuable information.
>>
>> T@le
>>
>> Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :
>>
>> If TCP error occurs then you need to check the network metrics. Yes, TCP
>> DUMP can help you.
>>
>>
>> Thanks & Regards
>> Amithsha
>>
>>
>> On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:
>>
>> Hello !
>>
>> @Amith sha <am...@gmail.com>
>> I checked also the system metrics, nothing wrong in CPU, RAM or IO.
>> The only thing I found was these TCP errors (ListenDrop).
>>
>> @HK
>> I'm monitoring a lot of JVM metrics like this one :
>> "UnderReplicatedBlocks" in the bean
>> "Hadoop:service=NameNode,name=FSNamesystem".
>> And I found no under replicated blocks when the problem of timeout
>> occurs, unfortunately.
>> Thanks for you advice, in addition to the tcpdump, I'll perform some
>> jstacks to see if I can find what ipc handlers are doing.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>>
>>
>>
>> Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :
>>
>> Hi Tape,
>> Could you please thread dump of namenode process. Could you please check
>> what ipc handlers are doing.
>>
>> We faced similar issue when the under replication is high in the cluster
>> due to filesystem wirteLock.
>>
>> On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:
>>
>> Check your system metrics too.
>>
>> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:
>>
>> Yeah, next step is for me to perform a tcpdump just when the problem
>> occurs.
>> I want to know if my namenode does not accept connections because it
>> freezes for some reasons or because there is too many connections at a time.
>>
>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
>> And during this time, CallQueue is OK, Heap is OK, I don't find any
>> metrics which could show me a problem inside the namenode JVM.
>>
>> Best regards.
>>
>> T@le
>>
>> Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :
>>
>> If you still concerned about the delay of > 2 s then you need to do
>> benchmark with and without load. To find the root cause of the problem it
>> will help.
>>
>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:
>>
>> Hello Amith.
>>
>> Hm, not a bad idea. If I increase the size of the listenQueue and if I
>> increase timeout, the combination of both may mitigate more the problem
>> than just increasing listenQueue size.
>> It won't solve the problem of acceptance speed, but it could help.
>>
>> Thanks for the suggestion !
>>
>> T@le
>>
>> Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
>> waiting for channel to be ready for connect.
>> Connection timed out after 20000 milli sec i suspect this value is very
>> low for a namenode with 75Gb of heap usage. Can you increase the value to
>> 5sec and check the connection. To increase the value modify this property
>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to
>> the core-site.xml)
>>
>>
>> Thanks & Regards
>> Amithsha
>>
>>
>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:
>>
>> Hello Tom.
>>
>> Sorry for my absence of answers, I don't know why gmail puts your mail
>> into spam -_-.
>>
>> To answer you :
>>
>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC
>>    metric are all OK
>>    - Threads are plenty sufficient (I can see the metrics also for them
>>    and I  am below 200, the number I have for 8020 RPC server)
>>
>> Did you see my other answers about this problem ?
>> I would be interested to have your opinion about that !
>>
>> Best regards.
>>
>> T@le
>>
>>
>> Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :
>>
>> It might be helpful to analyze namenode metrics and logs.
>>
>> What about some key metrics? Examples are callQueueLength, avgQueueTime,
>> avgProcessingTime and GC metrics.
>>
>> In addition, is the number of threads(dfs.namenode.service.handler.count)
>> in the namenode sufficient?
>>
>> Hopefully this will help.
>>
>> Best regards.
>> Tom
>>
>> Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:
>>
>> Hello.
>>
>> I encounter a strange problem with my namenode. I have the following
>> architecture :
>> - Two namenodes in HA
>> - 600 datanodes
>> - HDP 3.1.4
>> - 150 millions of files and folders
>>
>> Sometimes, when I query the namenode with the hdfs client, I got a
>> timeout error like this :
>> hdfs dfs -ls -d /user/myuser
>>
>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
>> failed on socket timeout exception:
>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>> while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
>> attempts. Trying to failover after sleeping for 2694ms.
>>
>> I checked the heap of the namenode and there is no problem (I have 75 GB
>> of max heap, I'm around 50 used GB).
>> I checked the threads of the clientRPC for the namenode and I'm at 200
>> which respects the recommandations from hadoop operations book.
>> I have serviceRPC enabled to prevent any problem which could be coming
>> from datanodes or ZKFC.
>> General resources seems OK, CPU usage is pretty fine, same for memory,
>> network or IO.
>> No firewall is enabled on my namenodes nor my client.
>>
>> I was wondering what could cause this problem, please ?
>>
>> Thank you in advance for your help !
>>
>> Best regards.
>>
>> T@le
>>
>>

Re: Client to namenode Socket timeout exception - connection-pending

Posted by Tale Hive <ta...@gmail.com>.
Hello Gurmukh Singh.

Hm, there is this information "Technical Preview" in the documentation you
mentioned.

[image: image.png]


In addition, I analyzed some TCPDUMP pcap files (I take "1 minute" samples
when the problem occured and when everything was OK) and I saw that the
problem is not linked with the number of connections requests.
This problems occures sometimes even when the number of connection requests
is not big. I see SYN packets without SYN-ACK answers when this problem
occures.

Furthemore, it is impossible to analyze thread dumps, because when we
perform a jstack on standby namenode, the jstack does not respond until the
peak of listenDrop is done.
So the previous jstack I got didn't help me to understand what is going on
from Namenode side.

Because of these information, as it is impossible to know what the namenode
is doing when the problem occures, and as we have a lot of free memory on
namenode host, we increased again the listenQueue from 4096 to 8192 and we
are monitoring the result.

We still see some listenDrop peaks, but they are a lot less frequent.

Best regards.

T@le


Le mar. 8 mars 2022 à 05:23, gurmukh singh <gu...@yahoo.com> a
écrit :

> "G1GC is still technical preview with namenode, isn't it ? No -> not for
> HDP 3.x
>
> "We have been using it with HBase on HDP and Cloudera clusters for some
> time and it works well.
>
> It is not in preview for HDP 3.1.X
>
> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/data-storage/content/recommended_settings_for_g1gc.html
>
> On Tuesday, 1 March, 2022, 08:15:12 pm GMT+11, Tale Hive <
> tale2.hive@gmail.com> wrote:
>
>
> Hello.
>
> Are you sure that there are no JVM pauses?
>
>
> Yes, I check multiple metrics (used heap and max heap,
> GcTotalExtraSleepTime and also the logs from the namenode). None of these
> metrics indicates a JVM pause when the problem occures.
>
> Here is an example of this morning, you can see listenDrops peak on active
> namenode (yellow) from 09:27:00 to 09:30:00.
> [image: image.png]
>
> And here are the GC logs at this period :
> 2022-03-01T09:26:57.647+0100: 512535.780: [GC (CMS Initial Mark) [1
> CMS-initial-mark: 50350809K(70391808K)] 58394222K(79441920K), 0.3132453
> secs] [Times: user=4.92 sys=0.00, real=0.31 secs]
> 2022-03-01T09:26:57.961+0100: 512536.093: [CMS-concurrent-mark-start]
> 2022-03-01T09:26:58.262+0100: 512536.395: [GC (Allocation Failure)
> 2022-03-01T09:26:58.262+0100: 512536.395: [ParNew:
> 8094551K->47062K(9050112K), 0.1147194 secs]
> 58445361K->50399530K(79441920K), 0.1151125 secs] [Times: user=1.69
> sys=0.00, real=0.12 secs]
> 2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-mark:
> 35.668/35.942 secs] [Times: user=159.03 sys=19.73, real=35.94 secs]
> 2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-preclean-start]
> 2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-preclean:
> 0.495/0.511 secs] [Times: user=0.61 sys=0.36, real=0.51 secs]
> 2022-03-01T09:27:34.414+0100: 512572.547:
> [CMS-concurrent-abortable-preclean-start]
>  CMS: abort preclean due to time 2022-03-01T09:27:39.517+0100: 512577.650:
> [CMS-concurrent-abortable-preclean: 5.083/5.103 secs] [Times: user=6.99
> sys=2.78, real=5.11 secs]
> 2022-03-01T09:27:39.526+0100: 512577.659: [GC (CMS Final Remark) [YG
> occupancy: 4405873 K (9050112 K)]2022-03-01T09:27:39.526+0100: 512577.659:
> [Rescan (parallel) , 0.3355048 secs]2022-03-01T09:27:39.862+0100:
> 512577.995: [weak refs processing, 0.0000413
> secs]2022-03-01T09:27:39.862+0100: 512577.995: [class unloading, 0.0482024
> secs]2022-03-01T09:27:39.910+0100: 512578.043: [scrub symbol table,
> 0.0147892 secs]2022-03-01T09:27:39.925+0100: 512578.058: [scrub string
> table, 0.0018140 secs][1 CMS-remark: 50352467K(70391808K)]
> 54758340K(79441920K), 0.4007414 secs] [Times: user=2.50 sys=2.90, real=0.40
> secs]
> 2022-03-01T09:27:39.927+0100: 512578.060: [CMS-concurrent-sweep-start]
> 2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-sweep:
> 17.672/17.688 secs] [Times: user=23.00 sys=11.53, real=17.68 secs]
> 2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-reset-start]
> 2022-03-01T09:27:57.802+0100: 512595.935: [CMS-concurrent-reset:
> 0.187/0.187 secs] [Times: user=0.37 sys=0.00, real=0.19 secs]
> 2022-03-01T09:27:59.811+0100: 512597.944: [GC (CMS Initial Mark) [1
> CMS-initial-mark: 50343447K(70391808K)] 56591863K(79441920K), 0.2328587
> secs] [Times: user=3.62 sys=0.00, real=0.23 secs]
> 2022-03-01T09:28:00.044+0100: 512598.177: [CMS-concurrent-mark-start]
> 2022-03-01T09:28:18.515+0100: 512616.648: [GC (Allocation Failure)
> 2022-03-01T09:28:18.516+0100: 512616.648: [ParNew:
> 8091606K->44561K(9050112K), 0.1131251 secs]
> 58435054K->50390162K(79441920K), 0.1135414 secs] [Times: user=1.63
> sys=0.00, real=0.12 secs]
> 2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-mark:
> 35.063/35.319 secs] [Times: user=177.63 sys=5.00, real=35.32 secs]
> 2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-preclean-start]
> 2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-preclean:
> 0.536/0.555 secs] [Times: user=1.00 sys=0.66, real=0.56 secs]
> 2022-03-01T09:28:35.919+0100: 512634.052:
> [CMS-concurrent-abortable-preclean-start]
>  CMS: abort preclean due to time 2022-03-01T09:28:41.059+0100: 512639.191:
> [CMS-concurrent-abortable-preclean: 5.123/5.140 secs] [Times: user=8.80
> sys=4.16, real=5.14 secs]
> 2022-03-01T09:28:41.067+0100: 512639.200: [GC (CMS Final Remark) [YG
> occupancy: 3604963 K (9050112 K)]2022-03-01T09:28:41.068+0100: 512639.200:
> [Rescan (parallel) , 0.1591429 secs]2022-03-01T09:28:41.227+0100:
> 512639.360: [weak refs processing, 0.0000823
> secs]2022-03-01T09:28:41.227+0100: 512639.360: [class unloading, 0.0551753
> secs]2022-03-01T09:28:41.282+0100: 512639.415: [scrub symbol table,
> 0.0148688 secs]2022-03-01T09:28:41.297+0100: 512639.430: [scrub string
> table, 0.0015506 secs][1 CMS-remark: 50345600K(70391808K)]
> 53950564K(79441920K), 0.2311903 secs] [Times: user=2.05 sys=0.55, real=0.23
> secs]
> 2022-03-01T09:28:41.299+0100: 512639.432: [CMS-concurrent-sweep-start]
> 2022-03-01T09:28:57.951+0100: 512656.083: [GC (Allocation Failure)
> 2022-03-01T09:28:57.951+0100: 512656.084: [ParNew:
> 8089105K->57885K(9050112K), 0.0995224 secs]
> 58427961K->50399315K(79441920K), 0.0999603 secs] [Times: user=0.83
> sys=0.56, real=0.10 secs]
> 2022-03-01T09:29:07.938+0100: 512666.071: [GC (Allocation Failure)
> 2022-03-01T09:29:07.939+0100: 512666.071: [ParNew:
> 8102429K->87814K(9050112K), 0.1961157 secs]
> 58440735K->50428503K(79441920K), 0.1965228 secs] [Times: user=2.94
> sys=0.00, real=0.19 secs]
> 2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-sweep:
> 26.577/27.071 secs] [Times: user=85.98 sys=29.07, real=27.07 secs]
> 2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-reset-start]
> 2022-03-01T09:29:08.712+0100: 512666.845: [CMS-concurrent-reset:
> 0.342/0.342 secs] [Times: user=2.13 sys=0.00, real=0.34 secs]
> 2022-03-01T09:29:10.721+0100: 512668.854: [GC (CMS Initial Mark) [1
> CMS-initial-mark: 50340545K(70391808K)] 52645683K(79441920K), 0.0927209
> secs] [Times: user=1.42 sys=0.00, real=0.09 secs]
> 2022-03-01T09:29:10.814+0100: 512668.947: [CMS-concurrent-mark-start]
> 2022-03-01T09:29:16.975+0100: 512675.108: [GC (Allocation Failure)
> 2022-03-01T09:29:16.975+0100: 512675.108: [ParNew:
> 8132358K->90327K(9050112K), 0.0810929 secs]
> 58472904K->50433071K(79441920K), 0.0815880 secs] [Times: user=1.15
> sys=0.00, real=0.09 secs]
> 2022-03-01T09:29:24.290+0100: 512682.423: [GC (Allocation Failure)
> 2022-03-01T09:29:24.290+0100: 512682.423: [ParNew:
> 8134871K->101007K(9050112K), 0.0819188 secs]
> 58477615K->50446660K(79441920K), 0.0823128 secs] [Times: user=0.79
> sys=0.33, real=0.08 secs]
> 2022-03-01T09:29:31.967+0100: 512690.100: [GC (Allocation Failure)
> 2022-03-01T09:29:31.968+0100: 512690.100: [ParNew:
> 8145551K->85497K(9050112K), 0.0818475 secs]
> 58491204K->50434672K(79441920K), 0.0822342 secs] [Times: user=0.75
> sys=0.36, real=0.08 secs]
> 2022-03-01T09:29:38.683+0100: 512696.815: [GC (Allocation Failure)
> 2022-03-01T09:29:38.683+0100: 512696.816: [ParNew:
> 8130041K->96086K(9050112K), 0.0877875 secs]
> 58479216K->50447377K(79441920K), 0.0881567 secs] [Times: user=1.26
> sys=0.00, real=0.09 secs]
> 2022-03-01T09:29:48.960+0100: 512707.092: [CMS-concurrent-mark:
> 37.712/38.145 secs] [Times: user=320.76 sys=50.98, real=38.15 secs]
> 2022-03-01T09:29:48.960+0100: 512707.093: [CMS-concurrent-preclean-start]
> 2022-03-01T09:29:49.271+0100: 512707.404: [GC (Allocation Failure)
> 2022-03-01T09:29:49.271+0100: 512707.404: [ParNew:
> 8140630K->83188K(9050112K), 0.2629305 secs]
> 58491921K->50446168K(79441920K), 0.2633249 secs] [Times: user=0.44
> sys=3.02, real=0.26 secs]
> 2022-03-01T09:29:50.420+0100: 512708.552: [CMS-concurrent-preclean:
> 1.089/1.460 secs] [Times: user=5.27 sys=5.95, real=1.46 secs]
> 2022-03-01T09:29:50.420+0100: 512708.553:
> [CMS-concurrent-abortable-preclean-start]
> 2022-03-01T09:29:54.266+0100: 512712.398:
> [CMS-concurrent-abortable-preclean: 3.784/3.846 secs] [Times: user=17.07
> sys=11.10, real=3.85 secs]
> 2022-03-01T09:29:54.290+0100: 512712.422: [GC (CMS Final Remark) [YG
> occupancy: 4363915 K (9050112 K)]2022-03-01T09:29:54.290+0100: 512712.422:
> [Rescan (parallel) , 0.2017093 secs]2022-03-01T09:29:54.491+0100:
> 512712.624: [weak refs processing, 0.0000442
> secs]2022-03-01T09:29:54.491+0100: 512712.624: [class unloading, 0.0467518
> secs]2022-03-01T09:29:54.538+0100: 512712.671: [scrub symbol table,
> 0.0152955 secs]2022-03-01T09:29:54.554+0100: 512712.686: [scrub string
> table, 0.0014710 secs][1 CMS-remark: 50362980K(70391808K)]
> 54726895K(79441920K), 0.2655851 secs] [Times: user=2.69 sys=0.60, real=0.26
> secs]
> 2022-03-01T09:29:54.556+0100: 512712.688: [CMS-concurrent-sweep-start]
> 2022-03-01T09:29:59.109+0100: 512717.242: [GC (Allocation Failure)
> 2022-03-01T09:29:59.109+0100: 512717.242: [ParNew:
> 8127732K->87947K(9050112K), 0.0986180 secs]
> 58486527K->50450553K(79441920K), 0.0990293 secs] [Times: user=0.97
> sys=0.35, real=0.10 secs]
> 2022-03-01T09:30:10.092+0100: 512728.225: [GC (Allocation Failure)
> 2022-03-01T09:30:10.092+0100: 512728.225: [ParNew:
> 8132491K->90112K(9050112K), 0.0823441 secs]
> 58490044K->50450110K(79441920K), 0.0828723 secs] [Times: user=1.11
> sys=0.00, real=0.08 secs]
> 2022-03-01T09:30:18.262+0100: 512736.395: [GC (Allocation Failure)
> 2022-03-01T09:30:18.262+0100: 512736.395: [ParNew:
> 8134656K->97101K(9050112K), 0.0817780 secs]
> 58490119K->50455026K(79441920K), 0.0821940 secs] [Times: user=1.13
> sys=0.00, real=0.08 secs]
> 2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-sweep:
> 27.023/27.425 secs] [Times: user=134.40 sys=34.43, real=27.42 secs]
> 2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-reset-start]
> 2022-03-01T09:30:22.164+0100: 512740.297: [CMS-concurrent-reset:
> 0.183/0.183 secs] [Times: user=0.50 sys=0.00, real=0.18 secs]
> 2022-03-01T09:30:24.171+0100: 512742.304: [GC (CMS Initial Mark) [1
> CMS-initial-mark: 50356777K(70391808K)] 54936817K(79441920K), 0.2049809
> secs] [Times: user=3.17 sys=0.00, real=0.21 secs]
>
> Knowing that the following event are stop-the-world with CMS (initial mark
> and remark) :
> [image: image.png]
>
> Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>
>
> I am using CMS indeed, because G1GC is still technical preview with
> namenode, isn't it ?
>
> Best regards.
>
> T@le
>
>
> Le mar. 1 mars 2022 à 04:01, gurmukh singh <gu...@yahoo.com> a
> écrit :
>
> Are you sure that there are no JVM pauses?
>
> Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>
> On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <
> tale2.hive@gmail.com> wrote:
>
>
> Re.
>
> As a note, I managed to got one jstack from the standby namenode when the
> problem occured.
> Here was the state of the listener thread for port 8020 :
> "IPC Server listener on 8020" #122 daemon prio=5 os_prio=0
> tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f7a81525618> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
>         at
> org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
>         at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
>         at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)
>
>    Locked ownable synchronizers:
>         - None
>
> At this time, listenQueue was full and the number of "SYNs to LISTEN
> sockets dropped" increased by 6 000 :
> 25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped
> (...)
> 25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped
>
> I don't find anything about 0x00007f7a81525618 in the jstack
> unfortunately.
> It seems listener thread is waiting for something, but I don't know what
> for the moment.
>
> Best regards.
>
> T@le
>
>
>
>
>
>
>
> Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :
>
> Hello Gurmuck Singh.
>
> Thank you for your answers.
>
> Why 75GB heap size for NN? are you running a very large cluster?
> 50 GB of heap used? Can you check are talking about the NN heap itself or
> are you saying about the total mem used on the server?
> 50GB approx means 200 million blocks? do you have that many.
>
>
> I have ~150 millions of blocks/files and I set up this heap following the
> recommandations here :
>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html
>
> The formula is 20 X log base2(n); where n is the number of nodes.
> So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200)
> and then approx 20 threads per thousand nodes.
>
>
> I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185
> threads for the ClientRPC server (the one which listens on port 8020)
>
> $ sysctl -n net.core.somaxconn
>
> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>
> $ sysctl -n net.core.netdev_max_backlog
>
>
> net.core.somaxconn= 8432
> net.ipv4.tcp_max_syn_backlog = 4096
> net.core.netdev_max_backlog = 2000
>
>
> $ netstat -an | grep -c SYN_RECV
>
> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>
>
> I'll check again and get you more information.
>
> What do you see in the JN logs? and what about ZK logs?
> any logs in NN, ZK on the lines of "Slow sync'
>
> Didn't check these logs, going to check them and get back to you.
>
> What is the ZK heap?
>
> Zookeeper heap is 4 GB.
>
> Disk latency
> Heap
> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
> a high job workload)
> jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied
> env, it must be bumped up). This setting is not there in HDP by default,
> you have to put under custom-zoo.cfg
>
>
> I'm going to check this also.
>
> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>
> I can yes, I just need time to anonymize everything.
>
> Thanks again for your help.
>
> Best regards.
>
> T@le
>
>
>
> Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com>
> a écrit :
>
> Also, as you are using hive/beeline. You can fetch all the config as:
>
> beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v' >
> /tmp/BeelineSet.out
>
> Please attach the BeelineSet.out
>
> On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh
> <gu...@yahoo.com.invalid> wrote:
>
>
> on ZK side
>
> Important things:
>
> Disk latency
> Heap
> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
> a high job workload)
> jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied
> env, it must be bumped up). This setting is not there in HDP by default,
> you have to put under custom-zoo.cfg
>
>
> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>
>
>
> On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh
> <gu...@yahoo.com.invalid> wrote:
>
>
> @Tale Hive you provided the details in the first email, missed it.
>
> Can you provide me the output of below from Namenode:
>
> $ sysctl -n net.core.somaxconn
>
> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>
> $ sysctl -n net.core.netdev_max_backlog
>
> $ netstat -an | grep -c SYN_RECV
>
> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>
>
> What do you see in the JN logs? and what about ZK logs?
> any logs in NN, ZK on the lines of "Slow sync'
> What is the ZK heap?
>
>
>
> On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <
> gurmukh.dhillon@yahoo.com> wrote:
>
>
> I checked the heap of the namenode and there is no problem (I have 75 GB
> of max heap, I'm around 50 used GB).
>
>     Why 75GB heap size for NN? are you running a very large cluster?
>     50 GB of heap used? Can you check are talking about the NN heap itself
> or are you saying about the total mem used on the server?
>     50GB approx means 200 million blocks? do you have that many.
>
> I checked the threads of the clientRPC for the namenode and I'm at 200
> which respects the recommandations from hadoop operations book.
>     The formula is 20 X log base2(n); where n is the number of nodes.
>     So, if you have a thousand nodes we keep it to 200 (20 X
> log2(1024)=200) and then approx 20 threads per thousand nodes.
>
> I have serviceRPC enabled to prevent any problem which could be coming
> from datanodes or ZKFC.
>
>
> On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <
> tale2.hive@gmail.com> wrote:
>
>
> Hello.
>
> According to what I saw this morning, I can see that I am in the first
> situation in fact :
>
>    - Client sends one packet with flag SYN to namenode
>    - Namenode sends one packet with flags SYN, ACK to the client
>    - Client sends n packets with flags PSH, ACK to the namenode, for each
>    subfolder
>    - Namenode sends n packets PSH, ACK to the client, for the content of
>    each subfolder
>
> So the number of (PSH, ACK) packets from the tcpdump pcap file is not what
> is filling the accept queue of port 8020 ClientRPC server on Namenode.
>
> I'm going to focus on checking the packets with SYN flag which arrive to
> the namenode.
> After that, because the jstack provokes active namenode failover, I don't
> have many more tracks to follow except increase again the listenQueue, to
> mitigate the problem, not to solve it.
>
> Best regards.
>
> T@le
>
>
>
> Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :
>
> Hello guys.
>
> Still investigating the tcpdump. I don't see a lot of packets with the
> flag SYN when the listenQueue is full.
> What I see is a lot of packets with the flag "PSH, ACK" with data inside
> like this :
> getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol
> /apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
>
> It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>,
> how many SYN packets will it send to the namenode ? One in total or one by
> subfolder ?
> Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this
> situation :
> - Client sends one SYN packet to Namenode
> - Namenode sends one SYN-ACK packets to client
> - Client sends n ACK or (PSH, ACK) packets to Namenode
>
> Or this situation :
> - Client sends n SYN packet to Namenode
> - Namenode sends n SYN-ACK packets to client
> - Client sends n ACK or (PSH, ACK)
>
> It would mean an hdfs recursive listing on a path with a lot of subfolders
> could harm the other clients by sending too many packets to the namenode ?
>
> About the jstack, I tried it on the namenode JVM but it provoked a
> failover, as the namenode was not answering at all (in particular, no
> answer to ZKFC), and the jstack never ended, I had to kill it.
> I don't know if a kill -3 or a jstack -F could help, but at least jstack
> -F contains less valuable information.
>
> T@le
>
> Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :
>
> If TCP error occurs then you need to check the network metrics. Yes, TCP
> DUMP can help you.
>
>
> Thanks & Regards
> Amithsha
>
>
> On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello !
>
> @Amith sha <am...@gmail.com>
> I checked also the system metrics, nothing wrong in CPU, RAM or IO.
> The only thing I found was these TCP errors (ListenDrop).
>
> @HK
> I'm monitoring a lot of JVM metrics like this one :
> "UnderReplicatedBlocks" in the bean
> "Hadoop:service=NameNode,name=FSNamesystem".
> And I found no under replicated blocks when the problem of timeout occurs,
> unfortunately.
> Thanks for you advice, in addition to the tcpdump, I'll perform some
> jstacks to see if I can find what ipc handlers are doing.
>
> Best regards.
>
> T@le
>
>
>
>
>
>
> Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :
>
> Hi Tape,
> Could you please thread dump of namenode process. Could you please check
> what ipc handlers are doing.
>
> We faced similar issue when the under replication is high in the cluster
> due to filesystem wirteLock.
>
> On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:
>
> Check your system metrics too.
>
> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:
>
> Yeah, next step is for me to perform a tcpdump just when the problem
> occurs.
> I want to know if my namenode does not accept connections because it
> freezes for some reasons or because there is too many connections at a time.
>
> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
> And during this time, CallQueue is OK, Heap is OK, I don't find any
> metrics which could show me a problem inside the namenode JVM.
>
> Best regards.
>
> T@le
>
> Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :
>
> If you still concerned about the delay of > 2 s then you need to do
> benchmark with and without load. To find the root cause of the problem it
> will help.
>
> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello Amith.
>
> Hm, not a bad idea. If I increase the size of the listenQueue and if I
> increase timeout, the combination of both may mitigate more the problem
> than just increasing listenQueue size.
> It won't solve the problem of acceptance speed, but it could help.
>
> Thanks for the suggestion !
>
> T@le
>
> Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :
>
> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
> waiting for channel to be ready for connect.
> Connection timed out after 20000 milli sec i suspect this value is very
> low for a namenode with 75Gb of heap usage. Can you increase the value to
> 5sec and check the connection. To increase the value modify this property
> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to the
> core-site.xml)
>
>
> Thanks & Regards
> Amithsha
>
>
> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello Tom.
>
> Sorry for my absence of answers, I don't know why gmail puts your mail
> into spam -_-.
>
> To answer you :
>
>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC
>    metric are all OK
>    - Threads are plenty sufficient (I can see the metrics also for them
>    and I  am below 200, the number I have for 8020 RPC server)
>
> Did you see my other answers about this problem ?
> I would be interested to have your opinion about that !
>
> Best regards.
>
> T@le
>
>
> Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :
>
> It might be helpful to analyze namenode metrics and logs.
>
> What about some key metrics? Examples are callQueueLength, avgQueueTime,
> avgProcessingTime and GC metrics.
>
> In addition, is the number of threads(dfs.namenode.service.handler.count)
> in the namenode sufficient?
>
> Hopefully this will help.
>
> Best regards.
> Tom
>
> Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:
>
> Hello.
>
> I encounter a strange problem with my namenode. I have the following
> architecture :
> - Two namenodes in HA
> - 600 datanodes
> - HDP 3.1.4
> - 150 millions of files and folders
>
> Sometimes, when I query the namenode with the hdfs client, I got a timeout
> error like this :
> hdfs dfs -ls -d /user/myuser
>
> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
> failed on socket timeout exception:
>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
> while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
> attempts. Trying to failover after sleeping for 2694ms.
>
> I checked the heap of the namenode and there is no problem (I have 75 GB
> of max heap, I'm around 50 used GB).
> I checked the threads of the clientRPC for the namenode and I'm at 200
> which respects the recommandations from hadoop operations book.
> I have serviceRPC enabled to prevent any problem which could be coming
> from datanodes or ZKFC.
> General resources seems OK, CPU usage is pretty fine, same for memory,
> network or IO.
> No firewall is enabled on my namenodes nor my client.
>
> I was wondering what could cause this problem, please ?
>
> Thank you in advance for your help !
>
> Best regards.
>
> T@le
>
>

Re: Client to namenode Socket timeout exception - connection-pending

Posted by gurmukh singh <gu...@yahoo.com.INVALID>.
 "G1GC is still technical preview with namenode, isn't it ? No -> not for HDP 3.x

"We have been using it with HBase on HDP and Cloudera clusters for some time and it works well.

It is not in preview for HDP 3.1.Xhttps://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/data-storage/content/recommended_settings_for_g1gc.html
    On Tuesday, 1 March, 2022, 08:15:12 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello.

Are you sure that there are no JVM pauses? 


Yes, I check multiple metrics (used heap and max heap, GcTotalExtraSleepTime and also the logs from the namenode). None of these metrics indicates a JVM pause when the problem occures.

Here is an example of this morning, you can see listenDrops peak on active namenode (yellow) from 09:27:00 to 09:30:00.


And here are the GC logs at this period :2022-03-01T09:26:57.647+0100: 512535.780: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50350809K(70391808K)] 58394222K(79441920K), 0.3132453 secs] [Times: user=4.92 sys=0.00, real=0.31 secs]2022-03-01T09:26:57.961+0100: 512536.093: [CMS-concurrent-mark-start]
2022-03-01T09:26:58.262+0100: 512536.395: [GC (Allocation Failure) 2022-03-01T09:26:58.262+0100: 512536.395: [ParNew: 8094551K->47062K(9050112K), 0.1147194 secs] 58445361K->50399530K(79441920K), 0.1151125 secs] [Times: user=1.69 sys=0.00, real=0.12 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-mark: 35.668/35.942 secs] [Times: user=159.03 sys=19.73, real=35.94 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-preclean-start]
2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-preclean: 0.495/0.511 secs] [Times: user=0.61 sys=0.36, real=0.51 secs]
2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:27:39.517+0100: 512577.650: [CMS-concurrent-abortable-preclean: 5.083/5.103 secs] [Times: user=6.99 sys=2.78, real=5.11 secs]
2022-03-01T09:27:39.526+0100: 512577.659: [GC (CMS Final Remark) [YG occupancy: 4405873 K (9050112 K)]2022-03-01T09:27:39.526+0100: 512577.659: [Rescan (parallel) , 0.3355048 secs]2022-03-01T09:27:39.862+0100: 512577.995: [weak refs processing, 0.0000413 secs]2022-03-01T09:27:39.862+0100: 512577.995: [class unloading, 0.0482024 secs]2022-03-01T09:27:39.910+0100: 512578.043: [scrub symbol table, 0.0147892 secs]2022-03-01T09:27:39.925+0100: 512578.058: [scrub string table, 0.0018140 secs][1 CMS-remark: 50352467K(70391808K)] 54758340K(79441920K), 0.4007414 secs] [Times: user=2.50 sys=2.90, real=0.40 secs]
2022-03-01T09:27:39.927+0100: 512578.060: [CMS-concurrent-sweep-start]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-sweep: 17.672/17.688 secs] [Times: user=23.00 sys=11.53, real=17.68 secs]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-reset-start]
2022-03-01T09:27:57.802+0100: 512595.935: [CMS-concurrent-reset: 0.187/0.187 secs] [Times: user=0.37 sys=0.00, real=0.19 secs]
2022-03-01T09:27:59.811+0100: 512597.944: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50343447K(70391808K)] 56591863K(79441920K), 0.2328587 secs] [Times: user=3.62 sys=0.00, real=0.23 secs]
2022-03-01T09:28:00.044+0100: 512598.177: [CMS-concurrent-mark-start]
2022-03-01T09:28:18.515+0100: 512616.648: [GC (Allocation Failure) 2022-03-01T09:28:18.516+0100: 512616.648: [ParNew: 8091606K->44561K(9050112K), 0.1131251 secs] 58435054K->50390162K(79441920K), 0.1135414 secs] [Times: user=1.63 sys=0.00, real=0.12 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-mark: 35.063/35.319 secs] [Times: user=177.63 sys=5.00, real=35.32 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-preclean-start]
2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-preclean: 0.536/0.555 secs] [Times: user=1.00 sys=0.66, real=0.56 secs]
2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:28:41.059+0100: 512639.191: [CMS-concurrent-abortable-preclean: 5.123/5.140 secs] [Times: user=8.80 sys=4.16, real=5.14 secs]
2022-03-01T09:28:41.067+0100: 512639.200: [GC (CMS Final Remark) [YG occupancy: 3604963 K (9050112 K)]2022-03-01T09:28:41.068+0100: 512639.200: [Rescan (parallel) , 0.1591429 secs]2022-03-01T09:28:41.227+0100: 512639.360: [weak refs processing, 0.0000823 secs]2022-03-01T09:28:41.227+0100: 512639.360: [class unloading, 0.0551753 secs]2022-03-01T09:28:41.282+0100: 512639.415: [scrub symbol table, 0.0148688 secs]2022-03-01T09:28:41.297+0100: 512639.430: [scrub string table, 0.0015506 secs][1 CMS-remark: 50345600K(70391808K)] 53950564K(79441920K), 0.2311903 secs] [Times: user=2.05 sys=0.55, real=0.23 secs]
2022-03-01T09:28:41.299+0100: 512639.432: [CMS-concurrent-sweep-start]
2022-03-01T09:28:57.951+0100: 512656.083: [GC (Allocation Failure) 2022-03-01T09:28:57.951+0100: 512656.084: [ParNew: 8089105K->57885K(9050112K), 0.0995224 secs] 58427961K->50399315K(79441920K), 0.0999603 secs] [Times: user=0.83 sys=0.56, real=0.10 secs]
2022-03-01T09:29:07.938+0100: 512666.071: [GC (Allocation Failure) 2022-03-01T09:29:07.939+0100: 512666.071: [ParNew: 8102429K->87814K(9050112K), 0.1961157 secs] 58440735K->50428503K(79441920K), 0.1965228 secs] [Times: user=2.94 sys=0.00, real=0.19 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-sweep: 26.577/27.071 secs] [Times: user=85.98 sys=29.07, real=27.07 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-reset-start]
2022-03-01T09:29:08.712+0100: 512666.845: [CMS-concurrent-reset: 0.342/0.342 secs] [Times: user=2.13 sys=0.00, real=0.34 secs]
2022-03-01T09:29:10.721+0100: 512668.854: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50340545K(70391808K)] 52645683K(79441920K), 0.0927209 secs] [Times: user=1.42 sys=0.00, real=0.09 secs]
2022-03-01T09:29:10.814+0100: 512668.947: [CMS-concurrent-mark-start]
2022-03-01T09:29:16.975+0100: 512675.108: [GC (Allocation Failure) 2022-03-01T09:29:16.975+0100: 512675.108: [ParNew: 8132358K->90327K(9050112K), 0.0810929 secs] 58472904K->50433071K(79441920K), 0.0815880 secs] [Times: user=1.15 sys=0.00, real=0.09 secs]
2022-03-01T09:29:24.290+0100: 512682.423: [GC (Allocation Failure) 2022-03-01T09:29:24.290+0100: 512682.423: [ParNew: 8134871K->101007K(9050112K), 0.0819188 secs] 58477615K->50446660K(79441920K), 0.0823128 secs] [Times: user=0.79 sys=0.33, real=0.08 secs]
2022-03-01T09:29:31.967+0100: 512690.100: [GC (Allocation Failure) 2022-03-01T09:29:31.968+0100: 512690.100: [ParNew: 8145551K->85497K(9050112K), 0.0818475 secs] 58491204K->50434672K(79441920K), 0.0822342 secs] [Times: user=0.75 sys=0.36, real=0.08 secs]
2022-03-01T09:29:38.683+0100: 512696.815: [GC (Allocation Failure) 2022-03-01T09:29:38.683+0100: 512696.816: [ParNew: 8130041K->96086K(9050112K), 0.0877875 secs] 58479216K->50447377K(79441920K), 0.0881567 secs] [Times: user=1.26 sys=0.00, real=0.09 secs]
2022-03-01T09:29:48.960+0100: 512707.092: [CMS-concurrent-mark: 37.712/38.145 secs] [Times: user=320.76 sys=50.98, real=38.15 secs]
2022-03-01T09:29:48.960+0100: 512707.093: [CMS-concurrent-preclean-start]
2022-03-01T09:29:49.271+0100: 512707.404: [GC (Allocation Failure) 2022-03-01T09:29:49.271+0100: 512707.404: [ParNew: 8140630K->83188K(9050112K), 0.2629305 secs] 58491921K->50446168K(79441920K), 0.2633249 secs] [Times: user=0.44 sys=3.02, real=0.26 secs]
2022-03-01T09:29:50.420+0100: 512708.552: [CMS-concurrent-preclean: 1.089/1.460 secs] [Times: user=5.27 sys=5.95, real=1.46 secs]
2022-03-01T09:29:50.420+0100: 512708.553: [CMS-concurrent-abortable-preclean-start]
2022-03-01T09:29:54.266+0100: 512712.398: [CMS-concurrent-abortable-preclean: 3.784/3.846 secs] [Times: user=17.07 sys=11.10, real=3.85 secs]
2022-03-01T09:29:54.290+0100: 512712.422: [GC (CMS Final Remark) [YG occupancy: 4363915 K (9050112 K)]2022-03-01T09:29:54.290+0100: 512712.422: [Rescan (parallel) , 0.2017093 secs]2022-03-01T09:29:54.491+0100: 512712.624: [weak refs processing, 0.0000442 secs]2022-03-01T09:29:54.491+0100: 512712.624: [class unloading, 0.0467518 secs]2022-03-01T09:29:54.538+0100: 512712.671: [scrub symbol table, 0.0152955 secs]2022-03-01T09:29:54.554+0100: 512712.686: [scrub string table, 0.0014710 secs][1 CMS-remark: 50362980K(70391808K)] 54726895K(79441920K), 0.2655851 secs] [Times: user=2.69 sys=0.60, real=0.26 secs]
2022-03-01T09:29:54.556+0100: 512712.688: [CMS-concurrent-sweep-start]
2022-03-01T09:29:59.109+0100: 512717.242: [GC (Allocation Failure) 2022-03-01T09:29:59.109+0100: 512717.242: [ParNew: 8127732K->87947K(9050112K), 0.0986180 secs] 58486527K->50450553K(79441920K), 0.0990293 secs] [Times: user=0.97 sys=0.35, real=0.10 secs]
2022-03-01T09:30:10.092+0100: 512728.225: [GC (Allocation Failure) 2022-03-01T09:30:10.092+0100: 512728.225: [ParNew: 8132491K->90112K(9050112K), 0.0823441 secs] 58490044K->50450110K(79441920K), 0.0828723 secs] [Times: user=1.11 sys=0.00, real=0.08 secs]
2022-03-01T09:30:18.262+0100: 512736.395: [GC (Allocation Failure) 2022-03-01T09:30:18.262+0100: 512736.395: [ParNew: 8134656K->97101K(9050112K), 0.0817780 secs] 58490119K->50455026K(79441920K), 0.0821940 secs] [Times: user=1.13 sys=0.00, real=0.08 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-sweep: 27.023/27.425 secs] [Times: user=134.40 sys=34.43, real=27.42 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-reset-start]
2022-03-01T09:30:22.164+0100: 512740.297: [CMS-concurrent-reset: 0.183/0.183 secs] [Times: user=0.50 sys=0.00, real=0.18 secs]
2022-03-01T09:30:24.171+0100: 512742.304: [GC (CMS Initial Mark) [1 CMS-initial-mark: 50356777K(70391808K)] 54936817K(79441920K), 0.2049809 secs] [Times: user=3.17 sys=0.00, real=0.21 secs]

Knowing that the following event are stop-the-world with CMS (initial mark and remark) :


Hopefully, you are not using CMS, as JVM heap is much larger then 32GB 


I am using CMS indeed, because G1GC is still technical preview with namenode, isn't it ?

Best regards.
T@le


Le mar. 1 mars 2022 à 04:01, gurmukh singh <gu...@yahoo.com> a écrit :

 Are you sure that there are no JVM pauses?
Hopefully, you are not using CMS, as JVM heap is much larger then 32GB

    On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Re.
As a note, I managed to got one jstack from the standby namenode when the problem occured.Here was the state of the listener thread for port 8020 :"IPC Server listener on 8020" #122 daemon prio=5 os_prio=0 tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f7a81525618> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
        at org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
        at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)

   Locked ownable synchronizers:
        - None
At this time, listenQueue was full and the number of "SYNs to LISTEN sockets dropped" increased by 6 000 :25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped(...)
25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped

I don't find anything about 0x00007f7a81525618 in the jstack unfortunately.It seems listener thread is waiting for something, but I don't know what for the moment.
Best regards.
T@le

 




Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :

Hello Gurmuck Singh.
Thank you for your answers.


Why 75GB heap size for NN? are you running a very large cluster?50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?50GB approx means 200 million blocks? do you have that many.

I have ~150 millions of blocks/files and I set up this heap following the recommandations here :https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html

The formula is 20 X log base2(n); where n is the number of nodes.So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.

I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185 threads for the ClientRPC server (the one which listens on port 8020)


$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog




net.core.somaxconn= 8432
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 2000



$ netstat -an | grep -c SYN_RECV


$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


I'll check again and get you more information.


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'





Didn't check these logs, going to check them and get back to you.


What is the ZK heap?



Zookeeper heap is 4 GB.





Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


I'm going to check this also.


If you can send me the NN, JN, ZK logs; more than happy to look into it.

I can yes, I just need time to anonymize everything.


Thanks again for your help.

Best regards.

T@le





Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com> a écrit :

 Also, as you are using hive/beeline. You can fetch all the config as:

beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v' > /tmp/BeelineSet.out

Please attach the BeelineSet.out

    On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  on ZK side
Important things:
Disk latency
Heap
maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting a high job workload)
jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied env, it must be bumped up). This setting is not there in HDP by default, you have to put under custom-zoo.cfg


If you can send me the NN, JN, ZK logs; more than happy to look into it.


   On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh <gu...@yahoo.com.invalid> wrote:  
 
  @Tale Hive you provided the details in the first email, missed it.
Can you provide me the output of below from Namenode:

$ sysctl -n net.core.somaxconn

$ sysctl -n net.ipv4.tcp_max_syn_backlog

$ sysctl -n net.core.netdev_max_backlog


$ netstat -an | grep -c SYN_RECV

$ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l


What do you see in the JN logs? and what about ZK logs?any logs in NN, ZK on the lines of "Slow sync'What is the ZK heap?



    On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <gu...@yahoo.com> wrote:  
 
  I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).    
    Why 75GB heap size for NN? are you running a very large cluster?    50 GB of heap used? Can you check are talking about the NN heap itself or are you saying about the total mem used on the server?    50GB approx means 200 million blocks? do you have that many.

I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.    The formula is 20 X log base2(n); where n is the number of nodes.    So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200) and then approx 20 threads per thousand nodes.
I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.

    On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <ta...@gmail.com> wrote:  
 
 Hello.
According to what I saw this morning, I can see that I am in the first situation in fact :   
   - Client sends one packet with flag SYN to namenode
   - Namenode sends one packet with flags SYN, ACK to the client
   - Client sends n packets with flags PSH, ACK to the namenode, for each subfolder
   - Namenode sends n packets PSH, ACK to the client, for the content of each subfolder
So the number of (PSH, ACK) packets from the tcpdump pcap file is not what is filling the accept queue of port 8020 ClientRPC server on Namenode.
I'm going to focus on checking the packets with SYN flag which arrive to the namenode.
After that, because the jstack provokes active namenode failover, I don't have many more tracks to follow except increase again the listenQueue, to mitigate the problem, not to solve it.
Best regards.
T@le


Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :

Hello guys.
Still investigating the tcpdump. I don't see a lot of packets with the flag SYN when the listenQueue is full. 
What I see is a lot of packets with the flag "PSH, ACK" with data inside like this :getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol 
/apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>, how many SYN packets will it send to the namenode ? One in total or one by subfolder ?Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this situation :- Client sends one SYN packet to Namenode- Namenode sends one SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) packets to Namenode
Or this situation :- Client sends n SYN packet to Namenode- Namenode sends n SYN-ACK packets to client- Client sends n ACK or (PSH, ACK) 

It would mean an hdfs recursive listing on a path with a lot of subfolders could harm the other clients by sending too many packets to the namenode ?

About the jstack, I tried it on the namenode JVM but it provoked a failover, as the namenode was not answering at all (in particular, no answer to ZKFC), and the jstack never ended, I had to kill it.I don't know if a kill -3 or a jstack -F could help, but at least jstack -F contains less valuable information.

T@le

Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :

If TCP error occurs then you need to check the network metrics. Yes, TCP DUMP can help you. 


Thanks & Regards
Amithsha

On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:

Hello !

@Amith sha 
I checked also the system metrics, nothing wrong in CPU, RAM or IO.The only thing I found was these TCP errors (ListenDrop).
@HKI'm monitoring a lot of JVM metrics like this one : "UnderReplicatedBlocks" in the bean "Hadoop:service=NameNode,name=FSNamesystem".And I found no under replicated blocks when the problem of timeout occurs, unfortunately.Thanks for you advice, in addition to the tcpdump, I'll perform some jstacks to see if I can find what ipc handlers are doing.
Best regards.
T@le






Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :

Hi Tape,Could you please thread dump of namenode process. Could you please check what ipc handlers are doing. 
We faced similar issue when the under replication is high in the cluster due to filesystem wirteLock. 
On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:

Check your system metrics too. 
On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:

Yeah, next step is for me to perform a tcpdump just when the problem occurs.I want to know if my namenode does not accept connections because it freezes for some reasons or because there is too many connections at a time.
My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.And during this time, CallQueue is OK, Heap is OK, I don't find any metrics which could show me a problem inside the namenode JVM.

Best regards.
T@le

Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :

If you still concerned about the delay of > 2 s then you need to do benchmark with and without load. To find the root cause of the problem it will help.
On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:

Hello Amith.
Hm, not a bad idea. If I increase the size of the listenQueue and if I increase timeout, the combination of both may mitigate more the problem than just increasing listenQueue size.It won't solve the problem of acceptance speed, but it could help.
Thanks for the suggestion !
T@le
Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :

org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. 
Connection timed out after 20000 milli sec i suspect this value is very low for a namenode with 75Gb of heap usage. Can you increase the value to 5sec and check the connection. To increase the value modify this property
ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to the core-site.xml)


Thanks & Regards
Amithsha

On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:

Hello Tom.
Sorry for my absence of answers, I don't know why gmail puts your mail into spam -_-.
To answer you :   
   - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC metric are all OK
   - Threads are plenty sufficient (I can see the metrics also for them and I  am below 200, the number I have for 8020 RPC server)
Did you see my other answers about this problem ?I would be interested to have your opinion about that !
Best regards.
T@le


Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :

It might be helpful to analyze namenode metrics and logs.

What about some key metrics? Examples are callQueueLength, avgQueueTime, avgProcessingTime and GC metrics.

In addition, is the number of threads(dfs.namenode.service.handler.count) in the namenode sufficient? 

Hopefully this will help.

Best regards.
Tom

Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:

Hello.
I encounter a strange problem with my namenode. I have the following architecture :- Two namenodes in HA 
- 600 datanodes- HDP 3.1.4- 150 millions of files and folders

Sometimes, when I query the namenode with the hdfs client, I got a timeout error like this :
hdfs dfs -ls -d /user/myuser
22/02/14 15:07:44 INFO retry.RetryInvocationHandler: org.apache.hadoop.net.ConnectTimeoutException: Call From <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020 failed on socket timeout exception: 
  org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; 
  For more details see:  http://wiki.apache.org/hadoop/SocketTimeout, 
while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover attempts. Trying to failover after sleeping for 2694ms. 

I checked the heap of the namenode and there is no problem (I have 75 GB of max heap, I'm around 50 used GB).I checked the threads of the clientRPC for the namenode and I'm at 200 which respects the recommandations from hadoop operations book.I have serviceRPC enabled to prevent any problem which could be coming from datanodes or ZKFC.General resources seems OK, CPU usage is pretty fine, same for memory, network or IO.No firewall is enabled on my namenodes nor my client.

I was wondering what could cause this problem, please ?
Thank you in advance for your help !

Best regards.
T@le












        

  
  

Re: Client to namenode Socket timeout exception - connection-pending

Posted by Tale Hive <ta...@gmail.com>.
Hello.

Are you sure that there are no JVM pauses?
>

Yes, I check multiple metrics (used heap and max heap,
GcTotalExtraSleepTime and also the logs from the namenode). None of these
metrics indicates a JVM pause when the problem occures.

Here is an example of this morning, you can see listenDrops peak on active
namenode (yellow) from 09:27:00 to 09:30:00.
[image: image.png]

And here are the GC logs at this period :
2022-03-01T09:26:57.647+0100: 512535.780: [GC (CMS Initial Mark) [1
CMS-initial-mark: 50350809K(70391808K)] 58394222K(79441920K), 0.3132453
secs] [Times: user=4.92 sys=0.00, real=0.31 secs]
2022-03-01T09:26:57.961+0100: 512536.093: [CMS-concurrent-mark-start]
2022-03-01T09:26:58.262+0100: 512536.395: [GC (Allocation Failure)
2022-03-01T09:26:58.262+0100: 512536.395: [ParNew:
8094551K->47062K(9050112K), 0.1147194 secs]
58445361K->50399530K(79441920K), 0.1151125 secs] [Times: user=1.69
sys=0.00, real=0.12 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-mark:
35.668/35.942 secs] [Times: user=159.03 sys=19.73, real=35.94 secs]
2022-03-01T09:27:33.903+0100: 512572.036: [CMS-concurrent-preclean-start]
2022-03-01T09:27:34.414+0100: 512572.547: [CMS-concurrent-preclean:
0.495/0.511 secs] [Times: user=0.61 sys=0.36, real=0.51 secs]
2022-03-01T09:27:34.414+0100: 512572.547:
[CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:27:39.517+0100: 512577.650:
[CMS-concurrent-abortable-preclean: 5.083/5.103 secs] [Times: user=6.99
sys=2.78, real=5.11 secs]
2022-03-01T09:27:39.526+0100: 512577.659: [GC (CMS Final Remark) [YG
occupancy: 4405873 K (9050112 K)]2022-03-01T09:27:39.526+0100: 512577.659:
[Rescan (parallel) , 0.3355048 secs]2022-03-01T09:27:39.862+0100:
512577.995: [weak refs processing, 0.0000413
secs]2022-03-01T09:27:39.862+0100: 512577.995: [class unloading, 0.0482024
secs]2022-03-01T09:27:39.910+0100: 512578.043: [scrub symbol table,
0.0147892 secs]2022-03-01T09:27:39.925+0100: 512578.058: [scrub string
table, 0.0018140 secs][1 CMS-remark: 50352467K(70391808K)]
54758340K(79441920K), 0.4007414 secs] [Times: user=2.50 sys=2.90, real=0.40
secs]
2022-03-01T09:27:39.927+0100: 512578.060: [CMS-concurrent-sweep-start]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-sweep:
17.672/17.688 secs] [Times: user=23.00 sys=11.53, real=17.68 secs]
2022-03-01T09:27:57.615+0100: 512595.748: [CMS-concurrent-reset-start]
2022-03-01T09:27:57.802+0100: 512595.935: [CMS-concurrent-reset:
0.187/0.187 secs] [Times: user=0.37 sys=0.00, real=0.19 secs]
2022-03-01T09:27:59.811+0100: 512597.944: [GC (CMS Initial Mark) [1
CMS-initial-mark: 50343447K(70391808K)] 56591863K(79441920K), 0.2328587
secs] [Times: user=3.62 sys=0.00, real=0.23 secs]
2022-03-01T09:28:00.044+0100: 512598.177: [CMS-concurrent-mark-start]
2022-03-01T09:28:18.515+0100: 512616.648: [GC (Allocation Failure)
2022-03-01T09:28:18.516+0100: 512616.648: [ParNew:
8091606K->44561K(9050112K), 0.1131251 secs]
58435054K->50390162K(79441920K), 0.1135414 secs] [Times: user=1.63
sys=0.00, real=0.12 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-mark:
35.063/35.319 secs] [Times: user=177.63 sys=5.00, real=35.32 secs]
2022-03-01T09:28:35.363+0100: 512633.496: [CMS-concurrent-preclean-start]
2022-03-01T09:28:35.919+0100: 512634.052: [CMS-concurrent-preclean:
0.536/0.555 secs] [Times: user=1.00 sys=0.66, real=0.56 secs]
2022-03-01T09:28:35.919+0100: 512634.052:
[CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2022-03-01T09:28:41.059+0100: 512639.191:
[CMS-concurrent-abortable-preclean: 5.123/5.140 secs] [Times: user=8.80
sys=4.16, real=5.14 secs]
2022-03-01T09:28:41.067+0100: 512639.200: [GC (CMS Final Remark) [YG
occupancy: 3604963 K (9050112 K)]2022-03-01T09:28:41.068+0100: 512639.200:
[Rescan (parallel) , 0.1591429 secs]2022-03-01T09:28:41.227+0100:
512639.360: [weak refs processing, 0.0000823
secs]2022-03-01T09:28:41.227+0100: 512639.360: [class unloading, 0.0551753
secs]2022-03-01T09:28:41.282+0100: 512639.415: [scrub symbol table,
0.0148688 secs]2022-03-01T09:28:41.297+0100: 512639.430: [scrub string
table, 0.0015506 secs][1 CMS-remark: 50345600K(70391808K)]
53950564K(79441920K), 0.2311903 secs] [Times: user=2.05 sys=0.55, real=0.23
secs]
2022-03-01T09:28:41.299+0100: 512639.432: [CMS-concurrent-sweep-start]
2022-03-01T09:28:57.951+0100: 512656.083: [GC (Allocation Failure)
2022-03-01T09:28:57.951+0100: 512656.084: [ParNew:
8089105K->57885K(9050112K), 0.0995224 secs]
58427961K->50399315K(79441920K), 0.0999603 secs] [Times: user=0.83
sys=0.56, real=0.10 secs]
2022-03-01T09:29:07.938+0100: 512666.071: [GC (Allocation Failure)
2022-03-01T09:29:07.939+0100: 512666.071: [ParNew:
8102429K->87814K(9050112K), 0.1961157 secs]
58440735K->50428503K(79441920K), 0.1965228 secs] [Times: user=2.94
sys=0.00, real=0.19 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-sweep:
26.577/27.071 secs] [Times: user=85.98 sys=29.07, real=27.07 secs]
2022-03-01T09:29:08.370+0100: 512666.503: [CMS-concurrent-reset-start]
2022-03-01T09:29:08.712+0100: 512666.845: [CMS-concurrent-reset:
0.342/0.342 secs] [Times: user=2.13 sys=0.00, real=0.34 secs]
2022-03-01T09:29:10.721+0100: 512668.854: [GC (CMS Initial Mark) [1
CMS-initial-mark: 50340545K(70391808K)] 52645683K(79441920K), 0.0927209
secs] [Times: user=1.42 sys=0.00, real=0.09 secs]
2022-03-01T09:29:10.814+0100: 512668.947: [CMS-concurrent-mark-start]
2022-03-01T09:29:16.975+0100: 512675.108: [GC (Allocation Failure)
2022-03-01T09:29:16.975+0100: 512675.108: [ParNew:
8132358K->90327K(9050112K), 0.0810929 secs]
58472904K->50433071K(79441920K), 0.0815880 secs] [Times: user=1.15
sys=0.00, real=0.09 secs]
2022-03-01T09:29:24.290+0100: 512682.423: [GC (Allocation Failure)
2022-03-01T09:29:24.290+0100: 512682.423: [ParNew:
8134871K->101007K(9050112K), 0.0819188 secs]
58477615K->50446660K(79441920K), 0.0823128 secs] [Times: user=0.79
sys=0.33, real=0.08 secs]
2022-03-01T09:29:31.967+0100: 512690.100: [GC (Allocation Failure)
2022-03-01T09:29:31.968+0100: 512690.100: [ParNew:
8145551K->85497K(9050112K), 0.0818475 secs]
58491204K->50434672K(79441920K), 0.0822342 secs] [Times: user=0.75
sys=0.36, real=0.08 secs]
2022-03-01T09:29:38.683+0100: 512696.815: [GC (Allocation Failure)
2022-03-01T09:29:38.683+0100: 512696.816: [ParNew:
8130041K->96086K(9050112K), 0.0877875 secs]
58479216K->50447377K(79441920K), 0.0881567 secs] [Times: user=1.26
sys=0.00, real=0.09 secs]
2022-03-01T09:29:48.960+0100: 512707.092: [CMS-concurrent-mark:
37.712/38.145 secs] [Times: user=320.76 sys=50.98, real=38.15 secs]
2022-03-01T09:29:48.960+0100: 512707.093: [CMS-concurrent-preclean-start]
2022-03-01T09:29:49.271+0100: 512707.404: [GC (Allocation Failure)
2022-03-01T09:29:49.271+0100: 512707.404: [ParNew:
8140630K->83188K(9050112K), 0.2629305 secs]
58491921K->50446168K(79441920K), 0.2633249 secs] [Times: user=0.44
sys=3.02, real=0.26 secs]
2022-03-01T09:29:50.420+0100: 512708.552: [CMS-concurrent-preclean:
1.089/1.460 secs] [Times: user=5.27 sys=5.95, real=1.46 secs]
2022-03-01T09:29:50.420+0100: 512708.553:
[CMS-concurrent-abortable-preclean-start]
2022-03-01T09:29:54.266+0100: 512712.398:
[CMS-concurrent-abortable-preclean: 3.784/3.846 secs] [Times: user=17.07
sys=11.10, real=3.85 secs]
2022-03-01T09:29:54.290+0100: 512712.422: [GC (CMS Final Remark) [YG
occupancy: 4363915 K (9050112 K)]2022-03-01T09:29:54.290+0100: 512712.422:
[Rescan (parallel) , 0.2017093 secs]2022-03-01T09:29:54.491+0100:
512712.624: [weak refs processing, 0.0000442
secs]2022-03-01T09:29:54.491+0100: 512712.624: [class unloading, 0.0467518
secs]2022-03-01T09:29:54.538+0100: 512712.671: [scrub symbol table,
0.0152955 secs]2022-03-01T09:29:54.554+0100: 512712.686: [scrub string
table, 0.0014710 secs][1 CMS-remark: 50362980K(70391808K)]
54726895K(79441920K), 0.2655851 secs] [Times: user=2.69 sys=0.60, real=0.26
secs]
2022-03-01T09:29:54.556+0100: 512712.688: [CMS-concurrent-sweep-start]
2022-03-01T09:29:59.109+0100: 512717.242: [GC (Allocation Failure)
2022-03-01T09:29:59.109+0100: 512717.242: [ParNew:
8127732K->87947K(9050112K), 0.0986180 secs]
58486527K->50450553K(79441920K), 0.0990293 secs] [Times: user=0.97
sys=0.35, real=0.10 secs]
2022-03-01T09:30:10.092+0100: 512728.225: [GC (Allocation Failure)
2022-03-01T09:30:10.092+0100: 512728.225: [ParNew:
8132491K->90112K(9050112K), 0.0823441 secs]
58490044K->50450110K(79441920K), 0.0828723 secs] [Times: user=1.11
sys=0.00, real=0.08 secs]
2022-03-01T09:30:18.262+0100: 512736.395: [GC (Allocation Failure)
2022-03-01T09:30:18.262+0100: 512736.395: [ParNew:
8134656K->97101K(9050112K), 0.0817780 secs]
58490119K->50455026K(79441920K), 0.0821940 secs] [Times: user=1.13
sys=0.00, real=0.08 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-sweep:
27.023/27.425 secs] [Times: user=134.40 sys=34.43, real=27.42 secs]
2022-03-01T09:30:21.981+0100: 512740.114: [CMS-concurrent-reset-start]
2022-03-01T09:30:22.164+0100: 512740.297: [CMS-concurrent-reset:
0.183/0.183 secs] [Times: user=0.50 sys=0.00, real=0.18 secs]
2022-03-01T09:30:24.171+0100: 512742.304: [GC (CMS Initial Mark) [1
CMS-initial-mark: 50356777K(70391808K)] 54936817K(79441920K), 0.2049809
secs] [Times: user=3.17 sys=0.00, real=0.21 secs]

Knowing that the following event are stop-the-world with CMS (initial mark
and remark) :
[image: image.png]

Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>

I am using CMS indeed, because G1GC is still technical preview with
namenode, isn't it ?

Best regards.

T@le


Le mar. 1 mars 2022 à 04:01, gurmukh singh <gu...@yahoo.com> a
écrit :

> Are you sure that there are no JVM pauses?
>
> Hopefully, you are not using CMS, as JVM heap is much larger then 32GB
>
> On Monday, 28 February, 2022, 10:20:56 pm GMT+11, Tale Hive <
> tale2.hive@gmail.com> wrote:
>
>
> Re.
>
> As a note, I managed to got one jstack from the standby namenode when the
> problem occured.
> Here was the state of the listener thread for port 8020 :
> "IPC Server listener on 8020" #122 daemon prio=5 os_prio=0
> tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f7a81525618> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
>         at
> org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
>         at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
>         at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)
>
>    Locked ownable synchronizers:
>         - None
>
> At this time, listenQueue was full and the number of "SYNs to LISTEN
> sockets dropped" increased by 6 000 :
> 25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped
> (...)
> 25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped
>
> I don't find anything about 0x00007f7a81525618 in the jstack
> unfortunately.
> It seems listener thread is waiting for something, but I don't know what
> for the moment.
>
> Best regards.
>
> T@le
>
>
>
>
>
>
>
> Le lun. 28 févr. 2022 à 11:52, Tale Hive <ta...@gmail.com> a écrit :
>
> Hello Gurmuck Singh.
>
> Thank you for your answers.
>
> Why 75GB heap size for NN? are you running a very large cluster?
> 50 GB of heap used? Can you check are talking about the NN heap itself or
> are you saying about the total mem used on the server?
> 50GB approx means 200 million blocks? do you have that many.
>
>
> I have ~150 millions of blocks/files and I set up this heap following the
> recommandations here :
>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html
>
> The formula is 20 X log base2(n); where n is the number of nodes.
> So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200)
> and then approx 20 threads per thousand nodes.
>
>
> I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185
> threads for the ClientRPC server (the one which listens on port 8020)
>
> $ sysctl -n net.core.somaxconn
>
> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>
> $ sysctl -n net.core.netdev_max_backlog
>
>
> net.core.somaxconn= 8432
> net.ipv4.tcp_max_syn_backlog = 4096
> net.core.netdev_max_backlog = 2000
>
>
> $ netstat -an | grep -c SYN_RECV
>
> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>
>
> I'll check again and get you more information.
>
> What do you see in the JN logs? and what about ZK logs?
> any logs in NN, ZK on the lines of "Slow sync'
>
> Didn't check these logs, going to check them and get back to you.
>
> What is the ZK heap?
>
> Zookeeper heap is 4 GB.
>
> Disk latency
> Heap
> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
> a high job workload)
> jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied
> env, it must be bumped up). This setting is not there in HDP by default,
> you have to put under custom-zoo.cfg
>
>
> I'm going to check this also.
>
> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>
> I can yes, I just need time to anonymize everything.
>
> Thanks again for your help.
>
> Best regards.
>
> T@le
>
>
>
> Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <gu...@yahoo.com>
> a écrit :
>
> Also, as you are using hive/beeline. You can fetch all the config as:
>
> beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v' >
> /tmp/BeelineSet.out
>
> Please attach the BeelineSet.out
>
> On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh
> <gu...@yahoo.com.invalid> wrote:
>
>
> on ZK side
>
> Important things:
>
> Disk latency
> Heap
> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
> a high job workload)
> jute.maxbuffer=1GB (By default it is very low, especially in a kerberozied
> env, it must be bumped up). This setting is not there in HDP by default,
> you have to put under custom-zoo.cfg
>
>
> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>
>
>
> On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh
> <gu...@yahoo.com.invalid> wrote:
>
>
> @Tale Hive you provided the details in the first email, missed it.
>
> Can you provide me the output of below from Namenode:
>
> $ sysctl -n net.core.somaxconn
>
> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>
> $ sysctl -n net.core.netdev_max_backlog
>
> $ netstat -an | grep -c SYN_RECV
>
> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>
>
> What do you see in the JN logs? and what about ZK logs?
> any logs in NN, ZK on the lines of "Slow sync'
> What is the ZK heap?
>
>
>
> On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <
> gurmukh.dhillon@yahoo.com> wrote:
>
>
> I checked the heap of the namenode and there is no problem (I have 75 GB
> of max heap, I'm around 50 used GB).
>
>     Why 75GB heap size for NN? are you running a very large cluster?
>     50 GB of heap used? Can you check are talking about the NN heap itself
> or are you saying about the total mem used on the server?
>     50GB approx means 200 million blocks? do you have that many.
>
> I checked the threads of the clientRPC for the namenode and I'm at 200
> which respects the recommandations from hadoop operations book.
>     The formula is 20 X log base2(n); where n is the number of nodes.
>     So, if you have a thousand nodes we keep it to 200 (20 X
> log2(1024)=200) and then approx 20 threads per thousand nodes.
>
> I have serviceRPC enabled to prevent any problem which could be coming
> from datanodes or ZKFC.
>
>
> On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <
> tale2.hive@gmail.com> wrote:
>
>
> Hello.
>
> According to what I saw this morning, I can see that I am in the first
> situation in fact :
>
>    - Client sends one packet with flag SYN to namenode
>    - Namenode sends one packet with flags SYN, ACK to the client
>    - Client sends n packets with flags PSH, ACK to the namenode, for each
>    subfolder
>    - Namenode sends n packets PSH, ACK to the client, for the content of
>    each subfolder
>
> So the number of (PSH, ACK) packets from the tcpdump pcap file is not what
> is filling the accept queue of port 8020 ClientRPC server on Namenode.
>
> I'm going to focus on checking the packets with SYN flag which arrive to
> the namenode.
> After that, because the jstack provokes active namenode failover, I don't
> have many more tracks to follow except increase again the listenQueue, to
> mitigate the problem, not to solve it.
>
> Best regards.
>
> T@le
>
>
>
> Le mer. 23 févr. 2022 à 11:46, Tale Hive <ta...@gmail.com> a écrit :
>
> Hello guys.
>
> Still investigating the tcpdump. I don't see a lot of packets with the
> flag SYN when the listenQueue is full.
> What I see is a lot of packets with the flag "PSH, ACK" with data inside
> like this :
> getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol
> /apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
>
> It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>,
> how many SYN packets will it send to the namenode ? One in total or one by
> subfolder ?
> Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this
> situation :
> - Client sends one SYN packet to Namenode
> - Namenode sends one SYN-ACK packets to client
> - Client sends n ACK or (PSH, ACK) packets to Namenode
>
> Or this situation :
> - Client sends n SYN packet to Namenode
> - Namenode sends n SYN-ACK packets to client
> - Client sends n ACK or (PSH, ACK)
>
> It would mean an hdfs recursive listing on a path with a lot of subfolders
> could harm the other clients by sending too many packets to the namenode ?
>
> About the jstack, I tried it on the namenode JVM but it provoked a
> failover, as the namenode was not answering at all (in particular, no
> answer to ZKFC), and the jstack never ended, I had to kill it.
> I don't know if a kill -3 or a jstack -F could help, but at least jstack
> -F contains less valuable information.
>
> T@le
>
> Le mar. 22 févr. 2022 à 10:29, Amith sha <am...@gmail.com> a écrit :
>
> If TCP error occurs then you need to check the network metrics. Yes, TCP
> DUMP can help you.
>
>
> Thanks & Regards
> Amithsha
>
>
> On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello !
>
> @Amith sha <am...@gmail.com>
> I checked also the system metrics, nothing wrong in CPU, RAM or IO.
> The only thing I found was these TCP errors (ListenDrop).
>
> @HK
> I'm monitoring a lot of JVM metrics like this one :
> "UnderReplicatedBlocks" in the bean
> "Hadoop:service=NameNode,name=FSNamesystem".
> And I found no under replicated blocks when the problem of timeout occurs,
> unfortunately.
> Thanks for you advice, in addition to the tcpdump, I'll perform some
> jstacks to see if I can find what ipc handlers are doing.
>
> Best regards.
>
> T@le
>
>
>
>
>
>
> Le mar. 22 févr. 2022 à 04:30, HK <he...@gmail.com> a écrit :
>
> Hi Tape,
> Could you please thread dump of namenode process. Could you please check
> what ipc handlers are doing.
>
> We faced similar issue when the under replication is high in the cluster
> due to filesystem wirteLock.
>
> On Tue, 22 Feb 2022, 8:37 am Amith sha, <am...@gmail.com> wrote:
>
> Check your system metrics too.
>
> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <ta...@gmail.com> wrote:
>
> Yeah, next step is for me to perform a tcpdump just when the problem
> occurs.
> I want to know if my namenode does not accept connections because it
> freezes for some reasons or because there is too many connections at a time.
>
> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
> And during this time, CallQueue is OK, Heap is OK, I don't find any
> metrics which could show me a problem inside the namenode JVM.
>
> Best regards.
>
> T@le
>
> Le lun. 21 févr. 2022 à 16:32, Amith sha <am...@gmail.com> a écrit :
>
> If you still concerned about the delay of > 2 s then you need to do
> benchmark with and without load. To find the root cause of the problem it
> will help.
>
> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello Amith.
>
> Hm, not a bad idea. If I increase the size of the listenQueue and if I
> increase timeout, the combination of both may mitigate more the problem
> than just increasing listenQueue size.
> It won't solve the problem of acceptance speed, but it could help.
>
> Thanks for the suggestion !
>
> T@le
>
> Le lun. 21 févr. 2022 à 02:33, Amith sha <am...@gmail.com> a écrit :
>
> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
> waiting for channel to be ready for connect.
> Connection timed out after 20000 milli sec i suspect this value is very
> low for a namenode with 75Gb of heap usage. Can you increase the value to
> 5sec and check the connection. To increase the value modify this property
> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to the
> core-site.xml)
>
>
> Thanks & Regards
> Amithsha
>
>
> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <ta...@gmail.com> wrote:
>
> Hello Tom.
>
> Sorry for my absence of answers, I don't know why gmail puts your mail
> into spam -_-.
>
> To answer you :
>
>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC
>    metric are all OK
>    - Threads are plenty sufficient (I can see the metrics also for them
>    and I  am below 200, the number I have for 8020 RPC server)
>
> Did you see my other answers about this problem ?
> I would be interested to have your opinion about that !
>
> Best regards.
>
> T@le
>
>
> Le mar. 15 févr. 2022 à 02:16, tom lee <to...@gmail.com> a écrit :
>
> It might be helpful to analyze namenode metrics and logs.
>
> What about some key metrics? Examples are callQueueLength, avgQueueTime,
> avgProcessingTime and GC metrics.
>
> In addition, is the number of threads(dfs.namenode.service.handler.count)
> in the namenode sufficient?
>
> Hopefully this will help.
>
> Best regards.
> Tom
>
> Tale Hive <ta...@gmail.com> 于2022年2月14日周一 23:57写道:
>
> Hello.
>
> I encounter a strange problem with my namenode. I have the following
> architecture :
> - Two namenodes in HA
> - 600 datanodes
> - HDP 3.1.4
> - 150 millions of files and folders
>
> Sometimes, when I query the namenode with the hdfs client, I got a timeout
> error like this :
> hdfs dfs -ls -d /user/myuser
>
> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
> failed on socket timeout exception:
>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
> while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
> attempts. Trying to failover after sleeping for 2694ms.
>
> I checked the heap of the namenode and there is no problem (I have 75 GB
> of max heap, I'm around 50 used GB).
> I checked the threads of the clientRPC for the namenode and I'm at 200
> which respects the recommandations from hadoop operations book.
> I have serviceRPC enabled to prevent any problem which could be coming
> from datanodes or ZKFC.
> General resources seems OK, CPU usage is pretty fine, same for memory,
> network or IO.
> No firewall is enabled on my namenodes nor my client.
>
> I was wondering what could cause this problem, please ?
>
> Thank you in advance for your help !
>
> Best regards.
>
> T@le
>
>