You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by René Treffer <tr...@in.tum.de> on 2006/08/16 19:25:37 UTC
Nutch, samba and urls...
Hi,
I've just written an protocol-smb, it's really simple (code attached).
It uses the jcifs lib and seems to work - but there is some stuff I'd
like to discuss...
Nutch is glued to URL, which works if you write an URLHandler. No
Problem so far, but you can't install an URLHandler everywhere - have a
look at the jcifs FAQ ( http://jcifs.samba.org/src/docs/faq.html ). Most
important: It won't work in you war - so protocol plugins will be
useless in a web context! Might cause a lot of trouble.
Moreover Nutch will never be able to handle \\192.168.0.1\ correctly
with URL....
Converting directories into html lists suck. And reproducing the code is
even worse. Perhaps a virtual mime-type could be added (e.g.
"nutch/dir"). Almost forgotten: tell my how I should index files with "
and ' in there name (currently I check for ' and change the href
quotes). Same problem for file://
Most protocols are not mime-type aware (e.g. file:// - indexed my mp3
collection with the text parser, great fun!). I've added a simple
mime-type guess, but this shouldn't be part of the protocol handler.
Anyway, feel free to use the smb code, it's rather simple/basic.
There is still a multithreading issue left :( but the very basic
crawling process seems to works (-threads 1). I've not yet tested the
generated index (= I've not yet indexed my hd and I've not yet tried to
search)
I've added the apache header, hope this is ok.
Re: Nutch, samba and urls...
Posted by Sami Siren <ss...@gmail.com>.
Hi,
Could you please submit a JIRA issue and attach this (or perhaps the
diff for whole plugin exluding the jcifs .jar because it is lgpl) in it.
René Treffer wrote:
> Hi,
>
> I've just written an protocol-smb, it's really simple (code attached).
> It uses the jcifs lib and seems to work - but there is some stuff I'd
> like to discuss...
>
> Nutch is glued to URL, which works if you write an URLHandler. No
> Problem so far, but you can't install an URLHandler everywhere - have a
> look at the jcifs FAQ ( http://jcifs.samba.org/src/docs/faq.html ). Most
> important: It won't work in you war - so protocol plugins will be
> useless in a web context! Might cause a lot of trouble.
> Moreover Nutch will never be able to handle \\192.168.0.1\ correctly
> with URL....
Perhaps a custom URL parser (nutch currently uses URL class only for
parsing urls) could do the job here. I have seen custom implementations
at least in tomcat which we could perhaps borrow and extend if required.
>
> Converting directories into html lists suck. And reproducing the code is
> even worse. Perhaps a virtual mime-type could be added (e.g.
> "nutch/dir"). Almost forgotten: tell my how I should index files with "
> and ' in there name (currently I check for ' and change the href
> quotes). Same problem for file://
There could perhaps be a different crawler implementation to crawl local
filesystem and these shared windows resources (and perhaps webdav too)
efficiently.
--
Sami Siren