You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Max Stricker <st...@gmail.com> on 2012/01/15 17:15:54 UTC

Start crawl from Java without bin/nutch script

Hi Mailinglist,

I currently need to start the nutch crawl process from Java, as it should be accessible through a WebApp.
I fugured out that calling Crawl.main() with the right parameters should be the right way, as this is also done
by the nutch script.
However I get an exception I cannot solve:

crawl started in: /cygdrive/c/server/nutch/crawl
rootUrlDir = /cygdrive/c/server/nutch/urls
threads = 1
depth = 1
indexer=solr
solrUrl=http://localhost:8983/solr/
topN = 10
Injector: starting at 2012-01-15 16:51:44
Injector: crawlDb: /cygdrive/c/server/nutch/crawl/crawldb
Injector: urlDir: /cygdrive/c/server/nutch/urls
Injector: Converting injected urls to crawl db entries.
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/cygdrive/c/server/nutch/urls
 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:232)
 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:252)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420)
 at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:779)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
 at testapp.MyTest.main(MaxTest.java:33)
 at testapp.Main.main(Main.java:26)

However /cygdrive/c/server/nutch/urls exists and contains a file holding the urls to be crawled.
The development environment is Windows 7 where al files are in C:/server/nutch
From my App I build a jar, put it into nutch/libs and call it using bin/nutch testapp.Main from within Cygwin. 
I call it through Cygwin because executing it on Windows throws an Exception because the Injector
wants to perform a chmod.

Any ideas what is going wrong here?
Or is there any other way to start a full nutch cycle from within Java?
I could not find a dedicated API for that.

Regards

Re: Start crawl from Java without bin/nutch script

Posted by Cube Agen <ag...@gmail.com>.

I am using windows xp envirment. I put the urls folder in the
$NUTCH_HOME/rutime/local/bin and use the nutch cmd to run crawling. That's
fine.

You may follow http://wiki.apache.org/nutch/NutchTutorial to do so.


2012/1/16 Lewis John Mcgibbney <le...@gmail.com>

> Mmmm, I am not using Nutch on Windows at all, generally don't know too much
> about configuring Cygwin and really hope thetreis some more help out there.
>
> The main problem here seems to be that the relative path to
> /cygdrive/c/server/nutch/urls is not being interpreted correctly.
>
> You mention
> {bq}
> where al files are in C:/server/nutch
> {bq}
>
> would this not mean that your rootUrlDir should be something like
> /cygdrive/C:/server/nutch/urls???
>
> HTH
>
> On Sun, Jan 15, 2012 at 4:15 PM, Max Stricker <st...@gmail.com>
> wrote:
>
> > Hi Mailinglist,
> >
> > I currently need to start the nutch crawl process from Java, as it should
> > be accessible through a WebApp.
> > I fugured out that calling Crawl.main() with the right parameters should
> > be the right way, as this is also done
> > by the nutch script.
> > However I get an exception I cannot solve:
> >
> > crawl started in: /cygdrive/c/server/nutch/crawl
> > rootUrlDir = /cygdrive/c/server/nutch/urls
> > threads = 1
> > depth = 1
> > indexer=solr
> > solrUrl=http://localhost:8983/solr/
> > topN = 10
> > Injector: starting at 2012-01-15 16:51:44
> > Injector: crawlDb: /cygdrive/c/server/nutch/crawl/crawldb
> > Injector: urlDir: /cygdrive/c/server/nutch/urls
> > Injector: Converting injected urls to crawl db entries.
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > file:/cygdrive/c/server/nutch/urls
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:232)
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:252)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
> >  at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
> >  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:779)
> >  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
> >  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
> >  at testapp.MyTest.main(MaxTest.java:33)
> >  at testapp.Main.main(Main.java:26)
> >
> > However /cygdrive/c/server/nutch/urls exists and contains a file holding
> > the urls to be crawled.
> > The development environment is Windows 7 where al files are in
> > C:/server/nutch
> > From my App I build a jar, put it into nutch/libs and call it using
> > bin/nutch testapp.Main from within Cygwin.
> > I call it through Cygwin because executing it on Windows throws an
> > Exception because the Injector
> > wants to perform a chmod.
> >
> > Any ideas what is going wrong here?
> > Or is there any other way to start a full nutch cycle from within Java?
> > I could not find a dedicated API for that.
> >
> > Regards
>
>
>
>
> --
> *Lewis*
>

RE: Start crawl from Java without bin/nutch script

Posted by Ar...@csiro.au.

The path should be C:/server/nutch/urls. I know this is not what you would expect from Cygwin, but it works.

Regards,

Arkadi

> -----Original Message-----
> From: Lewis John Mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> Sent: Monday, 16 January 2012 4:35 AM
> To: user@nutch.apache.org
> Subject: Re: Start crawl from Java without bin/nutch script
> 
> Mmmm, I am not using Nutch on Windows at all, generally don't know too
> much
> about configuring Cygwin and really hope thetreis some more help out
> there.
> 
> The main problem here seems to be that the relative path to
> /cygdrive/c/server/nutch/urls is not being interpreted correctly.
> 
> You mention
> {bq}
> where al files are in C:/server/nutch
> {bq}
> 
> would this not mean that your rootUrlDir should be something like
> /cygdrive/C:/server/nutch/urls???
> 
> HTH
> 
> On Sun, Jan 15, 2012 at 4:15 PM, Max Stricker <st...@gmail.com>
> wrote:
> 
> > Hi Mailinglist,
> >
> > I currently need to start the nutch crawl process from Java, as it
> should
> > be accessible through a WebApp.
> > I fugured out that calling Crawl.main() with the right parameters
> should
> > be the right way, as this is also done
> > by the nutch script.
> > However I get an exception I cannot solve:
> >
> > crawl started in: /cygdrive/c/server/nutch/crawl
> > rootUrlDir = /cygdrive/c/server/nutch/urls
> > threads = 1
> > depth = 1
> > indexer=solr
> > solrUrl=http://localhost:8983/solr/
> > topN = 10
> > Injector: starting at 2012-01-15 16:51:44
> > Injector: crawlDb: /cygdrive/c/server/nutch/crawl/crawldb
> > Injector: urlDir: /cygdrive/c/server/nutch/urls
> > Injector: Converting injected urls to crawl db entries.
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > file:/cygdrive/c/server/nutch/urls
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.jav
> a:232)
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java
> :252)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.ja
> va:428)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:
> 420)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter
> .java:338)
> >  at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
> >  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:779)
> >  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
> >  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
> >  at testapp.MyTest.main(MaxTest.java:33)
> >  at testapp.Main.main(Main.java:26)
> >
> > However /cygdrive/c/server/nutch/urls exists and contains a file
> holding
> > the urls to be crawled.
> > The development environment is Windows 7 where al files are in
> > C:/server/nutch
> > From my App I build a jar, put it into nutch/libs and call it using
> > bin/nutch testapp.Main from within Cygwin.
> > I call it through Cygwin because executing it on Windows throws an
> > Exception because the Injector
> > wants to perform a chmod.
> >
> > Any ideas what is going wrong here?
> > Or is there any other way to start a full nutch cycle from within
> Java?
> > I could not find a dedicated API for that.
> >
> > Regards
> 
> 
> 
> 
> --
> *Lewis*

Re: Start crawl from Java without bin/nutch script

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Mmmm, I am not using Nutch on Windows at all, generally don't know too much
about configuring Cygwin and really hope thetreis some more help out there.

The main problem here seems to be that the relative path to
/cygdrive/c/server/nutch/urls is not being interpreted correctly.

You mention
{bq}
where al files are in C:/server/nutch
{bq}

would this not mean that your rootUrlDir should be something like
/cygdrive/C:/server/nutch/urls???

HTH

On Sun, Jan 15, 2012 at 4:15 PM, Max Stricker <st...@gmail.com> wrote:

> Hi Mailinglist,
>
> I currently need to start the nutch crawl process from Java, as it should
> be accessible through a WebApp.
> I fugured out that calling Crawl.main() with the right parameters should
> be the right way, as this is also done
> by the nutch script.
> However I get an exception I cannot solve:
>
> crawl started in: /cygdrive/c/server/nutch/crawl
> rootUrlDir = /cygdrive/c/server/nutch/urls
> threads = 1
> depth = 1
> indexer=solr
> solrUrl=http://localhost:8983/solr/
> topN = 10
> Injector: starting at 2012-01-15 16:51:44
> Injector: crawlDb: /cygdrive/c/server/nutch/crawl/crawldb
> Injector: urlDir: /cygdrive/c/server/nutch/urls
> Injector: Converting injected urls to crawl db entries.
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/cygdrive/c/server/nutch/urls
>  at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:232)
>  at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:252)
>  at
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428)
>  at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420)
>  at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
>  at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
>  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:779)
>  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
>  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
>  at testapp.MyTest.main(MaxTest.java:33)
>  at testapp.Main.main(Main.java:26)
>
> However /cygdrive/c/server/nutch/urls exists and contains a file holding
> the urls to be crawled.
> The development environment is Windows 7 where al files are in
> C:/server/nutch
> From my App I build a jar, put it into nutch/libs and call it using
> bin/nutch testapp.Main from within Cygwin.
> I call it through Cygwin because executing it on Windows throws an
> Exception because the Injector
> wants to perform a chmod.
>
> Any ideas what is going wrong here?
> Or is there any other way to start a full nutch cycle from within Java?
> I could not find a dedicated API for that.
>
> Regards




-- 
*Lewis*