You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Albert Strasheim <fu...@gmail.com> on 2008/02/11 23:12:30 UTC

Using Hadoop as a shared file system

Hello all

I have a slightly unusual use case that I hope I can use Hadoop for (maybe 
after writing a bit of code).

My setup looks as follows:

I have one machine containing somewhere from 750 GB to 2 TB of data on a 
disk or some kind of RAID-0 array. I am in full control of this machine. I 
have a backup of the part of the data that I haven't generated (maybe 30% of 
the data, the rest is calculated from this original data).

I have another disparate collection of machines running any number of 
operating systems (Windows, Linux, Solaris, etc.). I might not have 
administrative access on these machines. I can probably run a JVM on them. 
These machines should not be considered reliable in any sense of the word. 
Hosting a HDFS on them would be a bad idea -- on a bad day all of them might 
be down, they might get reformatted, you name it.

I need to process this data using legacy applications (e.g., C++ programs, 
MATLAB scripts, Python scripts) and Java applications.

The legacy applications typically perform operations that would be hard to 
parallelise (e.g. a program that can't easily be compiled on all the 
different machines, MATLAB licenses not being available for all the 
machines, etc.), and as such I would like to run these legacy apps only on 
the machine that directly has access to the data. I would prefer if I didn't 
have to teach these legacy apps about getting data out of HDFS.

Through the magic of the JVM, I can can consider parallelising the 
processing of the data done with Java programs.

These Java programs (parallelised with Hadoop's MapReduce, or GridGain, or 
whatever) need an easy way to access the data on this single machine.

Setting up a traditional shared file system (NFS and Samba come to mind) 
would be a pain for various reasons. A shared file system probably wouldn't 
be a bottleneck, since the amount of processing time required typically far 
outweighs the time it takes to access the data over the network (at least 
for the number of nodes I'm dealing with).

What I'm hoping to do is use Hadoop as my shared file system.

I would imagine that I would have to run the equivalent of a 
namenode+datanode+etc. on the machine that has direct access to the data so 
that it appears as a HDFS to the other machines. Making a copy of the data 
into a HDFS instance isn't really an option, due to the size of the data I'm 
dealing with, so I'm thinking along the lines of exposing a LocalFileSystem 
on this machine as a DistributedFileSystem to the other machines.

This has the added advantage that if I ever do get a stable HDFS setup 
going, my programs will be ready to deal with it.

Is something like this doable already? If not, where would one need to start 
filling in code to make it possible?

Thanks for reading.

Regards,

Albert


Re: Using Hadoop as a shared file system

Posted by Albert Strasheim <fu...@gmail.com>.
Hello,

I'm not exactly sure how to apply WebDAV to the scenario I outlined.
As I said, I would like to avoid having to duplicate my data inside a
HDFS. Once I have that, I guess I could expose it as a WebDAV service
when HADOOP-496 gets done.

If you mean a WebDAV setup outside of Hadoop, that is something I
could possibly consider, but I would like to reuse the Hadoop
FileSystem interface if possible. I'd prefer not to have to integrate
a WebDAV server and client into my Java code if I can just reuse work
from Hadoop.

Comments?

Regards,

Albert

On Feb 12, 2008 12:44 AM, Fernando Padilla <fe...@alum.mit.edu> wrote:
> Have you put any thought into Webdav?  Or did you write that off as well?

Re: Using Hadoop as a shared file system

Posted by Fernando Padilla <fe...@alum.mit.edu>.
Have you put any thought into Webdav?  Or did you write that off as well?


Albert Strasheim wrote:
> Hello all
> 
> I have a slightly unusual use case that I hope I can use Hadoop for 
> (maybe after writing a bit of code).
> 
> My setup looks as follows:
> 
> I have one machine containing somewhere from 750 GB to 2 TB of data on a 
> disk or some kind of RAID-0 array. I am in full control of this machine. 
> I have a backup of the part of the data that I haven't generated (maybe 
> 30% of the data, the rest is calculated from this original data).
> 
> I have another disparate collection of machines running any number of 
> operating systems (Windows, Linux, Solaris, etc.). I might not have 
> administrative access on these machines. I can probably run a JVM on 
> them. These machines should not be considered reliable in any sense of 
> the word. Hosting a HDFS on them would be a bad idea -- on a bad day all 
> of them might be down, they might get reformatted, you name it.
> 
> I need to process this data using legacy applications (e.g., C++ 
> programs, MATLAB scripts, Python scripts) and Java applications.
> 
> The legacy applications typically perform operations that would be hard 
> to parallelise (e.g. a program that can't easily be compiled on all the 
> different machines, MATLAB licenses not being available for all the 
> machines, etc.), and as such I would like to run these legacy apps only 
> on the machine that directly has access to the data. I would prefer if I 
> didn't have to teach these legacy apps about getting data out of HDFS.
> 
> Through the magic of the JVM, I can can consider parallelising the 
> processing of the data done with Java programs.
> 
> These Java programs (parallelised with Hadoop's MapReduce, or GridGain, 
> or whatever) need an easy way to access the data on this single machine.
> 
> Setting up a traditional shared file system (NFS and Samba come to mind) 
> would be a pain for various reasons. A shared file system probably 
> wouldn't be a bottleneck, since the amount of processing time required 
> typically far outweighs the time it takes to access the data over the 
> network (at least for the number of nodes I'm dealing with).
> 
> What I'm hoping to do is use Hadoop as my shared file system.
> 
> I would imagine that I would have to run the equivalent of a 
> namenode+datanode+etc. on the machine that has direct access to the data 
> so that it appears as a HDFS to the other machines. Making a copy of the 
> data into a HDFS instance isn't really an option, due to the size of the 
> data I'm dealing with, so I'm thinking along the lines of exposing a 
> LocalFileSystem on this machine as a DistributedFileSystem to the other 
> machines.
> 
> This has the added advantage that if I ever do get a stable HDFS setup 
> going, my programs will be ready to deal with it.
> 
> Is something like this doable already? If not, where would one need to 
> start filling in code to make it possible?
> 
> Thanks for reading.
> 
> Regards,
> 
> Albert
>