You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Joe Reger, Jr." <jo...@joereger.com> on 2005/05/12 15:56:59 UTC

Nutch Control via Java with no Command Line?

First of all, thanks to everybody involved in Nutch.  It looks wonderful and
I can't wait to apply what you've done.
 
Is it possible to run and control Nutch completely within Tomcat 5.0.28 and
Java 1.4.2 using no command line? 
 
In other words, I'd like to avoid using the command line and instead call
the java classes directly on a scheduled or user-controlled basis from
Tomcat.  From what I see in bin/nutch I should be able to replace the
command:
 
bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
 
with something like:
 
net.nutch.tools.CrawlTool crawlTool = new net.nutch.tools.CrawlTool();
String[] args = new String[7];
args[0] = "urls";
args[1] = "-dir";
args[2] = "crawl.test";
args[3] = "-depth";
args[4] = "3";
args[5] = ">&";
args[6] = "crawl.log";
crawlTool.main(args);
 
Is this possible?  Is this smart?  What sort of issues will arrise if I try
to run everything from Tomcat/Java?
 
Thanks,
 
Joe Reger

Re: Nutch Control via Java with no Command Line?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Joe Reger, Jr. wrote:

> In other words, I'd like to avoid using the command line and instead call
> the java classes directly on a scheduled or user-controlled basis from
> Tomcat.  From what I see in bin/nutch I should be able to replace the
> command:
>  
> bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
>  
> with something like:
>  
> net.nutch.tools.CrawlTool crawlTool = new net.nutch.tools.CrawlTool();
> String[] args = new String[7];
> args[0] = "urls";
> args[1] = "-dir";
> args[2] = "crawl.test";
> args[3] = "-depth";
> args[4] = "3";
> args[5] = ">&";
> args[6] = "crawl.log";
> crawlTool.main(args);
>  
> Is this possible?  Is this smart?  What sort of issues will arrise if I try
> to run everything from Tomcat/Java?

First of all, it's not only perfectly possible, it's actually how the 
CrawlTool itself is implemented - please take a look at CrawlTool.main ...

The issues... Well, you need to keep in mind that most Nutch processing 
tasks consume a lot of resources, so if you run a task in the same JVM 
instance as the whole app server, then you can exhaust some resource 
(file handles, heap space, cpu/io, etc) and starve other applications 
that run on the same JVM.

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com