You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Manhee Jo <jo...@nttdocomo.com> on 2009/08/13 09:04:21 UTC

fuse-dfs then samba mount

Hi all,

I've succeeded in sharing hdfs files from windows xp through fuse-dfs then 
samba mount.
When I tried to copy (read and write) 1GB text file from fuse-dfs over 
samba, it took around 50 secs.
Then, I tried "dfs get" the same file to a data node's local file system and 
tried to copy the file
from the data node (without fuse-dfs this time) over samba, again, which 
took around 30 seconds.
Since the disk reads are paralleled and distributed, should it be faster 
then reading from one node?
Well, I know it must depend on the file size. So then, here is my question.
What is actually happening in fuse-dfs read? and samba?
What is the threshold of the file size when fuse-dfs might win?


Thanks,
Manhee








Re: fuse-dfs then samba mount

Posted by Todd Lipcon <to...@cloudera.com>.
On Thu, Aug 13, 2009 at 12:04 AM, Manhee Jo <jo...@nttdocomo.com> wrote:

> Hi all,
>
> I've succeeded in sharing hdfs files from windows xp through fuse-dfs then
> samba mount.
> When I tried to copy (read and write) 1GB text file from fuse-dfs over
> samba, it took around 50 secs.
> Then, I tried "dfs get" the same file to a data node's local file system
> and tried to copy the file
> from the data node (without fuse-dfs this time) over samba, again, which
> took around 30 seconds.
> Since the disk reads are paralleled and distributed, should it be faster
> then reading from one node?


Nope - the file is stored distributed, but a single reader (using dfs -get
or the DFSClient API from Java) won't do a parallel read from multiple
replicas. What you've seen seems about right - there's a measurable overhead
of going through the datanode compared to just using local disk.


>
> Well, I know it must depend on the file size. So then, here is my question.
> What is actually happening in fuse-dfs read? and samba?


It's a single connection to one datanode at a time. At the end of each
block, it connects to the DN that stores the next block and reads from that
one. At no time does it transfer in parallel from multiple replicas. Some
people have mentioned this as a feature request but it hasn't been
prioritized high yet for a multitude of reasons.

-Todd