You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ian C. Blenke" <ic...@nks.net> on 2005/08/30 16:22:26 UTC

Another NDFS question

Egor Chernodarov wrote:

>Hello!
>
>I want to test NDFS on my nutch installation, but I have some problem.
>I have started from wiki, where is quick demo for NDFS:
>http://wiki.apache.org/nutch/NutchDistributedFileSystem
>  
>
Would there be any interest in a FUSE (well FUSE-J) or FiST system level 
filesystem presentation?

I've written CornFS to solve an internal cluster storage problem, but 
NDFS looks like it would address the distributed archival problem with 
an eye toward retreival. 
(http://ian.blenke.com/blog/projects/cornfs/cornfs.html)

As Lucene/Apache are more multi-platform, something akin to a WebDAV 
backend might be more appropriate.

When NDFS is exposed to userspace for scripts to use, admins types will 
embrace it for managing the cluster.

It might not be a focus now, but it's seems to be a low hanging fruit 
that would only prove to help the project.

 - Ian C. Blenke <ic...@nks.net> <ia...@blenke.com> http://ian.blenke.com



Re: Another NDFS question

Posted by Doug Cutting <cu...@nutch.org>.
Ian C. Blenke wrote:
>> The only somewhat complicated thing would be directory listings.  
>> These would be handled with a simple REST interface, where some simple 
>> XML is returned.  Ideally a stylesheet could be specified so that one 
>> can use the directory listing url to view the filesystem from a brower.
> 
>  From a bash scripting standpoint, this would be complicated to access 
> without a userspace command to wrap it.

Good point.  With WebDAV has cadaver for shell access, so maybe WebDAV 
is the way to go.

> A simple WebDAV interface seems like the closest thing to a standard 
> that you are attempting to approximate with the RESTful interface. The 
> added benefit would be support from DavFS2, Finder, Microsoft 
> Webfolders, etc.
> 
> Perhaps something that plugs into Jakarta Slide? A NDFS backend to Slide 
> would potentially benefit a distributed CMS as well (without  a 
> versioning history, as that appears to be beyond the scope of NDFS).
> 
> I would be interested in implementing something like this if there is 
> indeed interest.

That would be great!

NDFS is designed to reliably and efficiently support very large data 
collections.  It is not designed to be a full-featured replacement for 
desktop filesystems, but rather is a lean-and-mean storage system for 
distributed computations.  Its primary users are developers and system 
administrators.  Such folks don't require fancy graphical user 
interfaces, but they are a nice bonus.  Programmatic access from 
non-Java is also a goal.  Easy publishing from, e.g., web authoring 
tools is not a goal.

WebDAV looks to me to meet these needs without too much baggage.  It may 
encourage non-target audiences to use NDFS, but we can deal with that as 
a documentation issue.  For example, sophisticated versioning, security 
and permission systems are outside the scope of NDFS.

Doug

Re: Another NDFS question

Posted by "Ian C. Blenke" <ic...@nks.net>.
Doug Cutting wrote:

> Ian C. Blenke wrote:
>
>> When NDFS is exposed to userspace for scripts to use, admins types 
>> will embrace it for managing the cluster.
>
> Our intent is to add some servlets which run on each datanode 
> providing access to the filesystem for non-Java programs.
>
> Most operations would be quite simple, e.g.:
>
> - to write a file, post its content to a url like:
>   http://datanode:XXXX/write?name=my.file
>
> - to read a file, get file content from urls like:
>   http://datanode:XXXX/read?name=my.file
>   http://datanode:XXXX/read?name=my.file&start=2048&length=1024
>
> - to remove a file:
>   http://datanode:XXX/remove?name=my.file
> Similarly for rename, copy, etc.

Not very RESTful, but simple.

> The only somewhat complicated thing would be directory listings.  
> These would be handled with a simple REST interface, where some simple 
> XML is returned.  Ideally a stylesheet could be specified so that one 
> can use the directory listing url to view the filesystem from a brower.

 From a bash scripting standpoint, this would be complicated to access 
without a userspace command to wrap it.

A RESTish interface works well for perl/python/ruby, though I think they 
would much rather have a native object wrapper (SWIG something together).

> These servlets could easily be implemented in terms of the 
> NutchFileSystem API, and deployed with Jetty.  To my knowledge, no one 
> is currently working on this.  A volunteer would be welcome.

If portability is a key goal, FUSE or FiST probably aren't the ideal (no 
Windows or OS/X ports, for example).

A simple WebDAV interface seems like the closest thing to a standard 
that you are attempting to approximate with the RESTful interface. The 
added benefit would be support from DavFS2, Finder, Microsoft 
Webfolders, etc.

Perhaps something that plugs into Jakarta Slide? A NDFS backend to Slide 
would potentially benefit a distributed CMS as well (without  a 
versioning history, as that appears to be beyond the scope of NDFS).

I would be interested in implementing something like this if there is 
indeed interest.

- Ian C. Blenke <ic...@nks.net> <ia...@blenke.com> http://ian.blenke.com/



Re: [Nutch-dev] Re: Another NDFS question

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
What you've just described, Doug, is WebDAV!   There is an  
implementation of it built into Tomcat, but a more full-featured  
version is Slide - http://jakarta.apache.org/slide/ .

There is also JSR (#170) for a content repository, being implemented  
open-source as Jackrabbit: http://incubator.apache.org/projects/ 
jackrabbit.html

Apache's mod_dav is also well worth mentioning, as it is extensible  
and surely quite fast.

I'm not sure how well any of these that I've mentioned jive with the  
goals of NDFS.  I have done a fair bit of homework on WebDAV in the  
past, once even implementing a prototype server before Slide was viable.

     Erik



On Aug 30, 2005, at 12:08 PM, Doug Cutting wrote:

> Ian C. Blenke wrote:
>
>> When NDFS is exposed to userspace for scripts to use, admins types  
>> will embrace it for managing the cluster.
>>
>
> Our intent is to add some servlets which run on each datanode  
> providing access to the filesystem for non-Java programs.
>
> Most operations would be quite simple, e.g.:
>
> - to write a file, post its content to a url like:
>   http://datanode:XXXX/write?name=my.file
>
> - to read a file, get file content from urls like:
>   http://datanode:XXXX/read?name=my.file
>   http://datanode:XXXX/read?name=my.file&start=2048&length=1024
>
> - to remove a file:
>   http://datanode:XXX/remove?name=my.file
>
> Similarly for rename, copy, etc.
>
> The only somewhat complicated thing would be directory listings.   
> These would be handled with a simple REST interface, where some  
> simple XML is returned.  Ideally a stylesheet could be specified so  
> that one can use the directory listing url to view the filesystem  
> from a brower.
>
> These servlets could easily be implemented in terms of the  
> NutchFileSystem API, and deployed with Jetty.  To my knowledge, no  
> one is currently working on this.  A volunteer would be welcome.
>
> Doug
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle  
> Practices
> Agile & Plan-Driven Development * Managing Projects & Teams *  
> Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/ 
> bsce5sf
> _______________________________________________
> Nutch-developers mailing list
> Nutch-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


Re: Another NDFS question

Posted by Doug Cutting <cu...@nutch.org>.
Ian C. Blenke wrote:
> When NDFS is exposed to userspace for scripts to use, admins types will 
> embrace it for managing the cluster.

Our intent is to add some servlets which run on each datanode providing 
access to the filesystem for non-Java programs.

Most operations would be quite simple, e.g.:

- to write a file, post its content to a url like:
   http://datanode:XXXX/write?name=my.file

- to read a file, get file content from urls like:
   http://datanode:XXXX/read?name=my.file
   http://datanode:XXXX/read?name=my.file&start=2048&length=1024

- to remove a file:
   http://datanode:XXX/remove?name=my.file

Similarly for rename, copy, etc.

The only somewhat complicated thing would be directory listings.  These 
would be handled with a simple REST interface, where some simple XML is 
returned.  Ideally a stylesheet could be specified so that one can use 
the directory listing url to view the filesystem from a brower.

These servlets could easily be implemented in terms of the 
NutchFileSystem API, and deployed with Jetty.  To my knowledge, no one 
is currently working on this.  A volunteer would be welcome.

Doug