You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/19 15:27:32 UTC

Idea for script/interface

Would this be usefull to anyone.  I am new to nutch but from what I've
seen so far it would make my life easier.

I was thinking about having a small web interface that would run the
crawler.  This would be more for intranet crawling rather then whole
web.  It would basicly have a database of what URL's you wanted to
crawl along with properties associated with that site (Regex, depth,
delays etc).   It would then monitor the log file and keep track of
number of sites crawler, # errors, # success and what not and display
it in a nice layout/status page.

I find that I'll be adding and deleting regex from the urlfilter files
all the time based on the sites i'm crawling.  It would be nice to
have all that organized in a database so if say I ever want to recrawl
a site, I just hit 2 buttons and all the regex load up, the right
starting URL and it just starts to crawl.

As I am new maybe this wouldn't be usefull and there is already
processes that take care of all this?  Or maybe i'm just not using
nutch correctly?  If this would help anyone I am thinking about
writing it and i'll make it availble to whoever wants it.

Re: Idea for script/interface

Posted by Lucas Rockwell <lu...@tsw.berkeley.edu>.
Hi Ian,

I am very new to Nutch as well, so take this for what it is.

First, I have been putting all my regex filters into the 
crawl-urlfilters file and just leaving them there even if I am only 
crawling a subset of sites, and I don't have any problems. Of course, 
if you have complicated regexes that might clash with each other then 
you would have to add/remove them.

Second, have you looked at archive.org's Heritrix 
(http://crawler.archive.org/)? It does a lot of what you are talking 
about (sans the database). I am not suggesting you use it, as the crawl 
output is probably not what you want, but the program does an amazing 
amount of things that might give you some (more) ideas. It might be 
worth your time just to check it out.

-lucas

On May 19, 2005, at 6:27 AM, Ian Reardon wrote:

> Would this be usefull to anyone.  I am new to nutch but from what I've
> seen so far it would make my life easier.
>
> I was thinking about having a small web interface that would run the
> crawler.  This would be more for intranet crawling rather then whole
> web.  It would basicly have a database of what URL's you wanted to
> crawl along with properties associated with that site (Regex, depth,
> delays etc).   It would then monitor the log file and keep track of
> number of sites crawler, # errors, # success and what not and display
> it in a nice layout/status page.
>
> I find that I'll be adding and deleting regex from the urlfilter files
> all the time based on the sites i'm crawling.  It would be nice to
> have all that organized in a database so if say I ever want to recrawl
> a site, I just hit 2 buttons and all the regex load up, the right
> starting URL and it just starts to crawl.
>
> As I am new maybe this wouldn't be usefull and there is already
> processes that take care of all this?  Or maybe i'm just not using
> nutch correctly?  If this would help anyone I am thinking about
> writing it and i'll make it availble to whoever wants it.


Re: Idea for script/interface

Posted by Jérôme Charron <je...@gmail.com>.
> As I am new maybe this wouldn't be usefull and there is already
> processes that take care of all this?  Or maybe i'm just not using
> nutch correctly?  If this would help anyone I am thinking about
> writing it and i'll make it availble to whoever wants it.
I recently planned (once I finished the language identifier
optimization and other little things) to work on such a "mini nutch
administration interface".
I really think it's a good idea, and the first step could be to
provide a web interface that is just an editor of the
nutch-default.xml configuration file.
A good source of inspiration could be the Google-Mini admnistration
demo (http://www.google.com/enterprise/mini/554_google_mini.html)

Jerome

-- 
http://motrech.free.fr/
http://frutch.free.fr/