You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Wei-Chiu Chuang <we...@cloudera.com.INVALID> on 2018/12/04 12:34:36 UTC

Re: HDFS DirectoryScanner is bothering me

Do you have a heapdump? Without a heapdump it's not easy to definitely
point to DirectoryScanner for GC issues.

That said, I did notice DirectoryScanner holding global lock for quite a
few seconds periodically, but that's unrelated. to GC.

On Thu, Nov 29, 2018 at 12:56 AM Yen-Onn Hiu <ye...@gmail.com> wrote:

> hi all,
>
> I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that
> the DirectoryScanner is keep causing java GC and slow down the hadoop
> nodes. Digging into the log file I discovered this:
>
> 2018-11-29 13:34:37,995 INFO
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool
> BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing
> metadata files:214, missing block files:214, missing blocks in memory:103,
> mismatched blocks:0
>
> Reading from internet posting, there are postings saying this is from
> DirectoryScanner which will be executed in every 6 hours. This directory
> scanning caused GC hiccup in all nodes and caused performance issues on the
> cluster.
>
> Question: when I am doing the hdfs dfsadmin -report. It does not say that
> I have any corrupted files. Also, I did the hdfs fsck / onto the directory
> and it does not yield any problems. How can I know what is the missing
> block files, missing blocks in memory and missing metadata files?
>
>
> Thanks!
>
> --
> Hiu Yen Onn
>
>
>

Re: HDFS DirectoryScanner is bothering me

Posted by Yen-Onn Hiu <ye...@gmail.com>.
hi Wei-Chiu,

Good to hear from you!

I have some facts that I have gathered in hoping that I can show you
something solid.

[image: Screen Shot 2561-12-05 at 08.41.42.png]

[image: Screen Shot 2561-12-05 at 09.21.17.png]

This is the graphite metrics that I have gathered from Java
Jvmmetrics/GcTimeMillis. From time to time, I could see GC hiccup of 3 to 4
seconds from every datanodes, every 6 hours (which is matching the interval
of dfs.datanode.directoryscan.interval, defaulted to 21600 sec). Following
on this hiccup, when I traced the log file I could clearly see this line

 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool
BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3894663, missing
metadata files:122, missing block files:122, missing blocks in memory:57,
mismatched blocks:0

By looking at the GC log which I have enabled, I can clearly seeing there
was a great time spent to do the object copy.

[GC cleanup 4452M->2474M(8192M), 0.0124395 secs]

 [Times: user=0.05 sys=0.00, real=0.01 secs]

2018-11-29T01:53:23.204+0700: 40161.800: [GC concurrent-cleanup-start]

2018-11-29T01:53:23.214+0700: 40161.810: [GC concurrent-cleanup-end,
0.0104121 secs]

2018-11-29T01:53:29.107+0700: 40167.702: [GC pause (G1 Evacuation Pause)
(young) 40167.702: [G1Ergonomics (CSet Construction) start choosing CSet,
_pending_cards: 65846, predicted base time: 30.37 ms, remaining time:
169.63 ms, target pause time: 200.00 ms]

 40167.702: [G1Ergonomics (CSet Construction) add young regions to CSet,
eden: 131 regions, survivors: 31 regions, predicted young region time:
296.21 ms]

 40167.702: [G1Ergonomics (CSet Construction) finish choosing CSet, eden:
131 regions, survivors: 31 regions, old: 0 regions, predicted pause time:
326.58 ms, target pause time: 200.00 ms]

 40168.874: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason:
reclaimable percentage not over threshold, candidate old regions: 138
regions, reclaimable: 496648080 bytes (5.78 %), threshold: 10.00 %]

, 1.1718634 secs]

   [Parallel Time: 1166.4 ms, GC Workers: 8]

      [GC Worker Start (ms): Min: 40167703.2, Avg: 40167703.3, Max:
40167703.3, Diff: 0.1]

      [Ext Root Scanning (ms): Min: 0.6, Avg: 0.8, Max: 1.0, Diff: 0.4,
Sum: 6.2]

      [Update RS (ms): Min: 68.1, Avg: 68.2, Max: 68.3, Diff: 0.2, Sum:
545.9]

         [Processed Buffers: Min: 24, Avg: 52.6, Max: 74, Diff: 50, Sum:
421]

      [Scan RS (ms): Min: 23.9, Avg: 24.2, Max: 24.7, Diff: 0.8, Sum: 193.6]

      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]

      *[Object Copy (ms): Min: 1013.7, Avg: 1015.8, Max: 1020.4, Diff: 6.7,
Sum: 8126.1]*

      [Termination (ms): Min: 51.8, Avg: 56.8, Max: 58.8, Diff: 7.0, Sum:
454.6]

         [Termination Attempts: Min: 53698, Avg: 56791.9, Max: 58670, Diff:
4972, Sum: 454335]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum:
0.5]

      [GC Worker Total (ms): Min: 1165.8, Avg: 1165.9, Max: 1166.0, Diff:
0.2, Sum: 9327.0]

      [GC Worker End (ms): Min: 40168869.1, Avg: 40168869.2, Max:
40168869.2, Diff: 0.1]

   [Code Root Fixup: 0.0 ms]

   [Code Root Purge: 0.0 ms]

   [Clear CT: 0.2 ms]

   [Other: 5.2 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 2.4 ms]

      [Ref Enq: 0.1 ms]

      [Redirty Cards: 0.8 ms]

      [Humongous Register: 0.2 ms]

      [Humongous Reclaim: 0.0 ms]

      [Free CSet: 0.8 ms]

   [Eden: 524.0M(284.0M)->0.0B(356.0M) Survivors: 124.0M->52.0M Heap:
2809.1M(8192.0M)->2387.1M(8192.0M)]

 [Times: user=8.90 sys=0.19, real=1.17 secs]


This is my java opts


-XX:+UseG1GC -Xmx8g -Xms8g -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=8
-XX:+UnlockExperimentalVMOptions -XX:+ParallelRefProcEnabled
-XX:InitiatingHeapOccupancyPercent=50 -XX:G1MixedGCLiveThresholdPercent=70
-XX:G1HeapWastePercent=10 -Xloggc:/var/log/hadoop-hdfs/datanode-gc.log
-XX:G1LogLevel=finest -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC
-XX:+PrintAdaptiveSizePolicy
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=8004


Together with this, I have generated a heapdump, since the dump file is too
big (357MB), I think I could not do a attach to this email. However, I have
attached some print screens. Not sure if what I have guessed is correct?
The big objects that referring from the GC log might could have been
FsDatasetImpl, and ReplicaMap. I am just speculating those java objects.


[image: Screen Shot 2561-12-05 at 08.53.07.png]

[image: Screen Shot 2561-12-05 at 08.53.35.png]
[image: Screen Shot 2561-12-05 at 09.08.10.png]

[image: Screen Shot 2561-12-05 at 08.58.20.png]


Please let me know if anything that I can do? I have gone through this link
(
https://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/)
many times to understand the behaviour of DirectoryScanner, BlockScanner,
VolumeScanner. Here are some of my questions that I am still want to know.


1. > INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner:
BlockPool BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3894663,
missing metadata files:122, missing block files:122, missing blocks in
memory:57, mismatched blocks:0

Also, I would like to know what is the cause of having missing metadata
files, missing block files, missing blocks in memory.


2. What is the result of prolonging the dfs.datanode.directoryscan.interval
from 6 hours to 12 hours. Will it give any harms to the system?


3. Will Block scanning fixing this? But it has been defaulted to 3 weeks to
run it once. Can I trigger the scan explicitly from command line?


Thank you so much for your attention.







On Tue, Dec 4, 2018 at 8:35 PM Wei-Chiu Chuang <we...@cloudera.com> wrote:

> Do you have a heapdump? Without a heapdump it's not easy to definitely
> point to DirectoryScanner for GC issues.
>
> That said, I did notice DirectoryScanner holding global lock for quite a
> few seconds periodically, but that's unrelated. to GC.
>
> On Thu, Nov 29, 2018 at 12:56 AM Yen-Onn Hiu <ye...@gmail.com> wrote:
>
>> hi all,
>>
>> I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that
>> the DirectoryScanner is keep causing java GC and slow down the hadoop
>> nodes. Digging into the log file I discovered this:
>>
>> 2018-11-29 13:34:37,995 INFO
>> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool
>> BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing
>> metadata files:214, missing block files:214, missing blocks in memory:103,
>> mismatched blocks:0
>>
>> Reading from internet posting, there are postings saying this is from
>> DirectoryScanner which will be executed in every 6 hours. This directory
>> scanning caused GC hiccup in all nodes and caused performance issues on the
>> cluster.
>>
>> Question: when I am doing the hdfs dfsadmin -report. It does not say that
>> I have any corrupted files. Also, I did the hdfs fsck / onto the directory
>> and it does not yield any problems. How can I know what is the missing
>> block files, missing blocks in memory and missing metadata files?
>>
>>
>> Thanks!
>>
>> --
>> Hiu Yen Onn
>>
>>
>>

-- 
Hiu Yen Onn