You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Rakesh Radhakrishnan (Jira)" <ji...@apache.org> on 2020/06/29 12:13:00 UTC

[jira] [Updated] (HDDS-3900) Update default value for 'ozone.om.ratis.segment.size' and 'preallocated.size' to improve OM write perf

     [ https://issues.apache.org/jira/browse/HDDS-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rakesh Radhakrishnan updated HDDS-3900:
---------------------------------------
    Description: 
Based on the *OM* performance tests on HDDs - write heavy workload {{Synthetic NNLoadGenerator}} in single node HA, the default 16KB ratis segment size becomes the bottleneck which affects the OM performance.

Below is the IOSTAT with 16KB segment.size and 16KB segment.preallocated.size, which causes high w_await time and very minimal batching <5 occurred in OM (most of the time during the run).
{code:java}
sdd: RATIS DISK
Device:  rrqm/s   wrqm/s  r/s   w/s    rMB/s  wMB/s  avgrq-sz  avgqu-sz   await   r_await     w_await    svctm    %util
sdd      0.00     0.00   0.00  138.00  0.00   1.27   18.88     21.99      65.25    0.00       65.25      6.88     94.90
sdd      0.00     0.00   0.00  103.00  0.00   1.07   21.23     40.36      918.25   0.00       918.25     9.72     100.10
sdd      0.00     0.00   0.00  104.00  0.00   1.04   20.55     30.08      1388.23  0.00       1388.23    9.62     100.10
sdd      0.00     0.00   0.00  396.00  0.00   1.55   8.00      136.50     285.30   0.00       285.30     2.40      94.90
{code}
 
 Below is the IOSTAT with 16MB segment.size and 16MB segment.preallocated.size. which minimizes the {{w_await}} time. This gives good performance improvement in traditional HDDs by doing more sync batching.
{code:java}
sdd: RATIS DISK
Device:  rrqm/s  wrqm/s  r/s    w/s      rMB/s   wMB/s   avgrq-sz  avgqu-sz  await  r_await   w_await   svctm   %util
sdd      0.00    0.00    0.00   125.74   0.00    19.85   323.34    3.05      24.28  0.00      24.28     7.17    90.10
sdd      0.00    0.00    0.00   128.00   0.00    19.76   316.12    3.31      25.91  0.00      25.91     7.14    91.40
sdd      0.00    0.00    0.00   115.00   0.00    4.59    81.81     0.93      8.10   0.00      8.10      8.04    92.50
sdd      0.00    0.00    0.00   111.00   0.00    4.53    83.57     0.90      8.12   0.00      8.12      8.14    90.30
sdd      0.00    0.00    0.00   115.00   0.00    4.64    82.64     0.93      8.08   0.00      8.08      8.10    93.20
{code}
 

Below is the IOSTAT with 4MB segment.size and 4MB segment.preallocated.size. which also minimizes the {{w_await}} time.
{code:java}
Device:  rrqm/s   wrqm/s  r/s    w/s       rMB/s   wMB/s  avgrq-sz  avgqu-sz  await   r_await  w_await   svctm  %util
sdd      0.00     0.00    0.00  115.00     0.00    6.08   108.34     0.99     8.57    0.00     8.57      8.10   93.20
sdd      0.00     0.00    0.00  122.00     0.00    7.81   131.13     1.48     12.15   0.00     12.15     7.80   95.10
sdd      0.00     0.00    0.00  115.00     0.00    7.81   139.04     1.05     9.09    0.00     9.09      8.10   93.20
sdd      0.00     0.00    0.00  115.00     0.00    7.85   139.78     1.04     8.95    0.00     8.95      8.03   92.30
sdd      0.00     0.00    0.00  114.00     0.00    5.83   104.70     0.97     8.57    0.00     8.57      7.97   90.90
sdd      0.00     0.00    0.00  115.00     0.00    7.80   138.92     1.05     9.10    0.00     9.10      8.11   93.30
sdd      0.00     0.00    0.00  119.00     0.00    7.93   136.47     1.72     14.41   0.00     14.41     7.81   92.90
{code}
 

Recommended config could be a value in MBs,  probably a value *higher than > 2MB or >4MB*

  was:
Based on OM performance tests - write heavy workload {{Synthetic NNLoadGenerator}} in single node HA, the default 16KB ratis segment size becomes the bottleneck which affects the OM performance.

Below is the IOSTAT with 16KB segment.size and 16KB segment.preallocated.size, which causes high w_await time and very minimal batching <5 occurred in OM most of the time during the run.
{code:java}
sdd: RATIS DISK
Device:  rrqm/s     wrqm/s    r/s       w/s         rMB/s    wMB/s  avgrq-sz    avgqu-sz     await    r_await     w_await    svctm    %util
sdd          0.00        0.00       0.00    138.00    0.00      1.27      18.88         21.99          65.25    0.00          65.25       6.88      94.90
sdd          0.00        0.00       0.00    103.00    0.00      1.07      21.23         40.36          918.25  0.00          918.25     9.72     100.10
sdd          0.00        0.00       0.00    104.00    0.00      1.04      20.55         30.08        1388.23  0.00          1388.23   9.62     100.10
sdd          0.00        0.00       0.00     396.00   0.00      1.55      8.00          136.50         285.30  0.00          285.30     2.40      94.90
{code}
 
 Below is the IOSTAT with 16MB segment.size and 16MB segment.preallocated.size. which minimizes the {{w_await}} time. This gives good performance improvement in traditional HDDs by doing more sync batching.
{code:java}
sdd: RATIS DISK
Device:  rrqm/s     wrqm/s    r/s       w/s         rMB/s    wMB/s  avgrq-sz    avgqu-sz     await    r_await     w_await    svctm    %util
sdd         0.00        0.00       0.00    125.74     0.00     19.85     323.34     3.05              24.28    0.00           24.28        7.17    90.10
sdd         0.00        0.00       0.00    128.00     0.00     19.76     316.12     3.31              25.91    0.00           25.91        7.14    91.40
sdd         0.00        0.00       0.00    115.00     0.00      4.59      81.81       0.93              8.10      0.00           8.10           8.04    92.50
sdd         0.00        0.00       0.00    111.00     0.00      4.53      83.57       0.90              8.12       0.00           8.12          8.14    90.30
sdd         0.00        0.00       0.00    115.00     0.00      4.64      82.64       0.93              8.08       0.00           8.08          8.10    93.20
{code}
 

Below is the IOSTAT with 4MB segment.size and 4MB segment.preallocated.size. which minimizes the {{w_await}} time.
{code:java}
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.00    0.00  115.00     0.00     6.08   108.34     0.99    8.57    0.00    8.57   8.10  93.20
sdd               0.00     0.00    0.00  122.00     0.00     7.81   131.13     1.48   12.15    0.00   12.15   7.80  95.10
sdd               0.00     0.00    0.00  115.00     0.00     7.81   139.04     1.05    9.09    0.00    9.09   8.10  93.20
sdd               0.00     0.00    0.00  115.00     0.00     7.85   139.78     1.04    8.95    0.00    8.95   8.03  92.30
sdd               0.00     0.00    0.00  114.00     0.00     5.83   104.70     0.97    8.57    0.00    8.57   7.97  90.90
sdd               0.00     0.00    0.00  115.00     0.00     7.80   138.92     1.05    9.10    0.00    9.10   8.11  93.30
sdd               0.00     0.00    0.00  119.00     0.00     7.93   136.47     1.72   14.41    0.00   14.41   7.81  92.90
{code}
 

Recommended config could be a value in MBs,  probably a value *higher than > 4MB.*


> Update default value for 'ozone.om.ratis.segment.size' and 'preallocated.size' to improve OM write perf
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-3900
>                 URL: https://issues.apache.org/jira/browse/HDDS-3900
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Rakesh Radhakrishnan
>            Assignee: Rakesh Radhakrishnan
>            Priority: Major
>
> Based on the *OM* performance tests on HDDs - write heavy workload {{Synthetic NNLoadGenerator}} in single node HA, the default 16KB ratis segment size becomes the bottleneck which affects the OM performance.
> Below is the IOSTAT with 16KB segment.size and 16KB segment.preallocated.size, which causes high w_await time and very minimal batching <5 occurred in OM (most of the time during the run).
> {code:java}
> sdd: RATIS DISK
> Device:  rrqm/s   wrqm/s  r/s   w/s    rMB/s  wMB/s  avgrq-sz  avgqu-sz   await   r_await     w_await    svctm    %util
> sdd      0.00     0.00   0.00  138.00  0.00   1.27   18.88     21.99      65.25    0.00       65.25      6.88     94.90
> sdd      0.00     0.00   0.00  103.00  0.00   1.07   21.23     40.36      918.25   0.00       918.25     9.72     100.10
> sdd      0.00     0.00   0.00  104.00  0.00   1.04   20.55     30.08      1388.23  0.00       1388.23    9.62     100.10
> sdd      0.00     0.00   0.00  396.00  0.00   1.55   8.00      136.50     285.30   0.00       285.30     2.40      94.90
> {code}
>  
>  Below is the IOSTAT with 16MB segment.size and 16MB segment.preallocated.size. which minimizes the {{w_await}} time. This gives good performance improvement in traditional HDDs by doing more sync batching.
> {code:java}
> sdd: RATIS DISK
> Device:  rrqm/s  wrqm/s  r/s    w/s      rMB/s   wMB/s   avgrq-sz  avgqu-sz  await  r_await   w_await   svctm   %util
> sdd      0.00    0.00    0.00   125.74   0.00    19.85   323.34    3.05      24.28  0.00      24.28     7.17    90.10
> sdd      0.00    0.00    0.00   128.00   0.00    19.76   316.12    3.31      25.91  0.00      25.91     7.14    91.40
> sdd      0.00    0.00    0.00   115.00   0.00    4.59    81.81     0.93      8.10   0.00      8.10      8.04    92.50
> sdd      0.00    0.00    0.00   111.00   0.00    4.53    83.57     0.90      8.12   0.00      8.12      8.14    90.30
> sdd      0.00    0.00    0.00   115.00   0.00    4.64    82.64     0.93      8.08   0.00      8.08      8.10    93.20
> {code}
>  
> Below is the IOSTAT with 4MB segment.size and 4MB segment.preallocated.size. which also minimizes the {{w_await}} time.
> {code:java}
> Device:  rrqm/s   wrqm/s  r/s    w/s       rMB/s   wMB/s  avgrq-sz  avgqu-sz  await   r_await  w_await   svctm  %util
> sdd      0.00     0.00    0.00  115.00     0.00    6.08   108.34     0.99     8.57    0.00     8.57      8.10   93.20
> sdd      0.00     0.00    0.00  122.00     0.00    7.81   131.13     1.48     12.15   0.00     12.15     7.80   95.10
> sdd      0.00     0.00    0.00  115.00     0.00    7.81   139.04     1.05     9.09    0.00     9.09      8.10   93.20
> sdd      0.00     0.00    0.00  115.00     0.00    7.85   139.78     1.04     8.95    0.00     8.95      8.03   92.30
> sdd      0.00     0.00    0.00  114.00     0.00    5.83   104.70     0.97     8.57    0.00     8.57      7.97   90.90
> sdd      0.00     0.00    0.00  115.00     0.00    7.80   138.92     1.05     9.10    0.00     9.10      8.11   93.30
> sdd      0.00     0.00    0.00  119.00     0.00    7.93   136.47     1.72     14.41   0.00     14.41     7.81   92.90
> {code}
>  
> Recommended config could be a value in MBs,  probably a value *higher than > 2MB or >4MB*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org