You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by robertito <ro...@gmail.com> on 2011/07/12 14:46:28 UTC

Re: Crawl fails - Input path does not exist

Hi,

I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the
tutorial:

http://wiki.apache.org/nutch/NutchTutorial

I'm trying to crawl wikipedia.org as a start, and having a similar problem
with the segments/content path that does not exist. The path does indeed not
exist (nothing got fetched)

Where do I have to adjust the disk space of my temporary directory?

Just something else: There are two conf directories in Nutch's distribution.
Which one is used? I'm updating the configuration files in both of them.

Thank you!
Regards,
Robert

Crawl Trace:

$ runtime/local/bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -dir
crawl -depth 8 -topN 50000 -threads 16
crawl started in: crawl
rootUrlDir = urls
threads = 16
depth = 8
solrUrl=http://127.0.0.1:8983/solr
topN = 50000
Injector: starting at 2011-07-12 14:34:16
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-12 14:34:19, elapsed: 00:00:03
Generator: starting at 2011-07-12 14:34:19
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 50000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20110712143422
Generator: finished at 2011-07-12 14:34:23, elapsed: 00:00:04
Fetcher: starting at 2011-07-12 14:34:23
Fetcher: segment: crawl/segments/20110712143422
Fetcher: threads: 16
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.wikipedia.org/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-12 14:34:26, elapsed: 00:00:02
ParseSegment: starting at 2011-07-12 14:34:26
ParseSegment: segment: crawl/segments/20110712143422
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/C:/tools/nutch-1.3/crawl/segments/20110712143422/content
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)


--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3162299.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl fails - Input path does not exist

Posted by robertito <ro...@gmail.com>.

Hi Lewis,

thanks for your reply!

I followed the tutorial, but then I changed some things because the tutorial
was updated for version 1.3 during the day... could it be?

Anyway, I have been following the error again after your reply and finally
managed to solve it by changing the value of the fetcher.store.content
property from false to true. That could have been the reason why the content
directory was not created, I think.
I had this value set to false because I took a configuration file from
another site (I got a bit confused with the missing crawl-urlfilter.txt file
and searched many different tutorials)

Thanks.
Regards,
Robert

--
View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3165066.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawl fails - Input path does not exist

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Robertito,

Please refer to the following tutorial for current 1.3 tutorial.

All should be pretty straight forward however if there is anything causing
confusion please post back.

On Tue, Jul 12, 2011 at 1:46 PM, robertito <ro...@gmail.com> wrote:

> Hi,
>
> I'm a beginner using Nutch 1.3 on Windows 7 with Cygwin and followed the
> tutorial:
>
> http://wiki.apache.org/nutch/NutchTutorial
>
> I'm trying to crawl wikipedia.org as a start, and having a similar problem
> with the segments/content path that does not exist. The path does indeed
> not
> exist (nothing got fetched)
>
> Where do I have to adjust the disk space of my temporary directory?
>
> Just something else: There are two conf directories in Nutch's
> distribution.
> Which one is used? I'm updating the configuration files in both of them.
>
> Thank you!
> Regards,
> Robert
>
> Crawl Trace:
>
> $ runtime/local/bin/nutch crawl urls -solr http://127.0.0.1:8983/solr -dir
> crawl -depth 8 -topN 50000 -threads 16
> crawl started in: crawl
> rootUrlDir = urls
> threads = 16
> depth = 8
> solrUrl=http://127.0.0.1:8983/solr
> topN = 50000
> Injector: starting at 2011-07-12 14:34:16
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-12 14:34:19, elapsed: 00:00:03
> Generator: starting at 2011-07-12 14:34:19
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20110712143422
> Generator: finished at 2011-07-12 14:34:23, elapsed: 00:00:04
> Fetcher: starting at 2011-07-12 14:34:23
> Fetcher: segment: crawl/segments/20110712143422
> Fetcher: threads: 16
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.wikipedia.org/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-12 14:34:26, elapsed: 00:00:02
> ParseSegment: starting at 2011-07-12 14:34:26
> ParseSegment: segment: crawl/segments/20110712143422
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/C:/tools/nutch-1.3/crawl/segments/20110712143422/content
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>        at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:137)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3162299.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*