You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2009/11/05 00:55:32 UTC
[jira] Updated: (HADOOP-3205) FSInputChecker and FSOutputSummer should allow better access to user buffer

     [ https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-3205:
--------------------------------

    Attachment: hadoop-3205.txt

Here's a patch that implements the FSInputChecker side of this ticket.

Benchmark results are promising. I put a 700MB file in /dev/shm with its associated checksum and then timed "hadoop fs -cat /dev/shm/bigfile" 100 times with the patch and without the patch. Here is R output from the analysis of these times:

{noformat}
> p.user <- read.table(file="/tmp/times.patch.user")
> p.sys <- read.table(file="/tmp/times.patch.sys")
> p.wall <- read.table(file="/tmp/times.patch.wall")
> t.user <- read.table(file="/tmp/times.trunk.user")
> t.sys <- read.table(file="/tmp/times.trunk.sys")
> t.wall <- read.table(file="/tmp/times.trunk.wall")
> t.test(t.user,p.user,alternative="greater")

        Welch Two Sample t-test

data:  t.user and p.user 
t = 21.0552, df = 134.54, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 0.4654936       Inf 
sample estimates:
mean of x mean of y 
 3.713000  3.207763 

> 3.2077/3.713
[1] 0.8639106
> t.test(t.sys,p.sys,alternative="greater")

        Welch Two Sample t-test

data:  t.sys and p.sys 
t = 1.3567, df = 137.286, p-value = 0.08856
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 -0.003768599          Inf 
sample estimates:
mean of x mean of y 
 0.980500  0.963421 

> t.test(t.wall,p.wall,alternative="greater")

        Welch Two Sample t-test

data:  t.wall and p.wall 
t = 6.5711, df = 118.318, p-value = 7.034e-10
alternative hypothesis: true difference in means is greater than 0 
95 percent confidence interval:
 0.3020628       Inf 
sample estimates:
mean of x mean of y 
 7.667800  7.263816
{noformat}

To interpret the results for those who don't know R:
- The user time is reduced with 100% confidence. With 95% confidence it's reduced by at least 0.465s = 12.5%
- The sys time is not significantly reduced - p > 0.05. This is consistent with our expectation that we're doing the same number of syscalls, just avoiding buffer copies in user space.
- Wall clock time is reduced with 100% confidence. With 95% confidence it's reduced by at least 0.302s = 3.9%.

I didn't include the R output, but analyis on the "CPU%" column of the "time" results gives 100% confidence of a reduction in CPU percent util, 95% confidence of at least 3.34%.

The patch itself can probably be improved - just wanted to get early comments. I did briefly test that HDFS still functions, but have not run through all the unit tests. I also want to rerun the above benchmarks with io.file.buffer.size tuned up to 64K or 128K as most people do in production.

> FSInputChecker and FSOutputSummer should allow better access to user buffer
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-3205
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3205
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>         Attachments: hadoop-3205.txt
>
>
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have access to full user buffer. At any time DFS can access only up to 512 bytes even though user usually reads with a much larger buffer (often controlled by io.file.buffer.size). This requires implementations to double buffer data if an implementation wants to read or write larger chunks of data from underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two separate jiras.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.