You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Christian Weiske <ch...@netresearch.de> on 2011/08/01 09:32:27 UTC

Error "Input path does not exist" when crawling

Hello,


I setup nutch 1.3 to crawl our mediawiki instance.
Somewhere during the crawling process I get an error that stops
everything:

---------
LinkDb: starting at 2011-08-01 09:27:51
LinkDb: linkdb: crawl-301/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801084037
LinkDb: adding segment:
[20 more of that]
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801083518/parse_data
Input path does not exist:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091638/parse_data
Input path does not exist:
file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091806/parse_data
	at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
	at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
	at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
	at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
	at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
	at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
	at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
	at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
---------


What can I do to fix this?


-- 
Viele Grüße
Christian Weiske

Re: Error "Input path does not exist" when crawling

Posted by Dinçer Kavraal <dk...@gmail.com>.

Hi Christian,

I have been busy with my problems a couple of days now and noticed it was
out of something minor. But what I have learned from that is, in
conf/log4j.properties, I set

log4j.logger.org.apache.nutch.crawl.Crawl=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Injector=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Generator=DEBUG, cmdst
log4j.logger.org.apache.nutch.crawl.Fetcher=DEBUG, cmdst

Therefore, I can examine in logs/hadoop.log what is going on when I am not
there. You may try this: you could try to inspect the traffic (headers
sent&got) with an HTTP sniffer for those emerging distressed URLs. There, I
caught mine.

viele Grüße...


2011/8/3 Christian Weiske <ch...@netresearch.de>

> Hallo Dinçer,
>
>
> > One more thing, will you share the stats as:
> >
> > *$ bin/nutch readdb crawl-dir/crawldb -stats*
> > *$ bin/nutch readseg -list crawl-dir/segments/**
> >
> > When I got that error, the latter list shows that one (or more)
> > segments is not finished well. But now you can see my segments seem
> > ok. What about yours?
>
> I got the problem again, and I have a segment without data:
>
> $ bin/nutch readdb crawl/crawldb -stats
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     1915
> retry 0:        1911
> retry 2:        4
> min score:      0.0
> avg score:      0.0013519583
> max score:      1.056
> status 1 (db_unfetched):        4
> status 2 (db_fetched):  1909
> status 3 (db_gone):     1
> status 4 (db_redir_temp):       1
> CrawlDb statistics: done
>
> $ bin/nutch readseg -list crawl/segments/*
> NAME            GENERATED       FETCHER START           FETCHER END
>     FETCHED PARSED
> 20110801134606  1               2011-08-01T13:46:11     2011-08-01T13:46:11
>     1       1
> 20110801134620  35              2011-08-01T13:46:23     2011-08-01T13:46:52
>     36      30
> 20110801134706  257             2011-08-01T13:47:08     2011-08-01T13:47:57
>     257     256
> 20110801134825  720             2011-08-01T13:48:28     2011-08-01T13:50:45
>     720     720
> 20110801135116  684             2011-08-01T13:51:18     2011-08-01T13:53:29
>     684     684
> 20110803090956  201             ?               ?       ?       ?
> 20110803091137  201             2011-08-03T09:11:41     2011-08-03T09:12:40
>     201     197
> 20110803091304  21              2011-08-03T09:13:07     2011-08-03T09:13:10
>     21      21
>
>
>
> --
> Viele Grüße
> Dipl.-Inf. Christian Weiske
>
> Senior Developer
> Netresearch GmbH & Co. KG
>

Re: Error "Input path does not exist" when crawling

Posted by Christian Weiske <ch...@netresearch.de>.

Hallo Dinçer,


> One more thing, will you share the stats as:
> 
> *$ bin/nutch readdb crawl-dir/crawldb -stats*
> *$ bin/nutch readseg -list crawl-dir/segments/**
> 
> When I got that error, the latter list shows that one (or more)
> segments is not finished well. But now you can see my segments seem
> ok. What about yours?

I got the problem again, and I have a segment without data:

$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:	1915
retry 0:	1911
retry 2:	4
min score:	0.0
avg score:	0.0013519583
max score:	1.056
status 1 (db_unfetched):	4
status 2 (db_fetched):	1909
status 3 (db_gone):	1
status 4 (db_redir_temp):	1
CrawlDb statistics: done

$ bin/nutch readseg -list crawl/segments/*
NAME		GENERATED	FETCHER START		FETCHER END		FETCHED	PARSED
20110801134606	1		2011-08-01T13:46:11	2011-08-01T13:46:11	1	1
20110801134620	35		2011-08-01T13:46:23	2011-08-01T13:46:52	36	30
20110801134706	257		2011-08-01T13:47:08	2011-08-01T13:47:57	257	256
20110801134825	720		2011-08-01T13:48:28	2011-08-01T13:50:45	720	720
20110801135116	684		2011-08-01T13:51:18	2011-08-01T13:53:29	684	684
20110803090956	201		?		?	?	?
20110803091137	201		2011-08-03T09:11:41	2011-08-03T09:12:40	201	197
20110803091304	21		2011-08-03T09:13:07	2011-08-03T09:13:10	21	21



-- 
Viele Grüße
Dipl.-Inf. Christian Weiske

Senior Developer
Netresearch GmbH & Co. KG

Re: Error "Input path does not exist" when crawling

Posted by Christian Weiske <ch...@netresearch.de>.

Hallo Dinçer,



> > > > Somewhere during the crawling process I get an error that stops
> > > > everything:
> > > > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/
> > > > segments/20110801090707
> > > > Exception in thread "main"
> > > > org.apache.hadoop.mapred.InvalidInputException: Input path does
> > > > not exist:
> > > I have had same problem in one of my instances. Let's dig
> > > together, at least. I have tried to re-crawl the url list into
> > > same crawl directory (crawl-301 in your case) and got the same
> > > error, will you confirm for your case?
> URLs does not matter actually. Same URLs may do it. Just try to do the
> crawling operation once more, just as in the first run. The thing is
> I am not out of disk space (for esp. tmp) and I can sometimes get it
> done without problems in this manner (yes I have some other problems
> such redirection).

I actually cannot reproduce the issue now. Very strange.

-- 
Viele Grüße
Dipl.-Inf. Christian Weiske

Senior Developer
Netresearch GmbH & Co. KG

Re: Error "Input path does not exist" when crawling

Posted by Dinçer Kavraal <dk...@gmail.com>.

Christian Hi,

URLs does not matter actually. Same URLs may do it. Just try to do the
crawling operation once more, just as in the first run. The thing is I am
not out of disk space (for esp. tmp) and I can sometimes get it done without
problems in this manner (yes I have some other problems such redirection).

But if I get once this error, when try to rerun the crawling:
# bin/nutch crawl -dir crawlIntoDir urlsDir -depth 2 -threads 25
get same error.

One more thing, will you share the stats as:

*$ bin/nutch readdb crawl-dir/crawldb -stats*
CrawlDb statistics start: crawl-dir/crawldb
Statistics for CrawlDb: crawl-dir/crawldb
TOTAL urls: 956
retry 0: 956
min score: 0.0
avg score: 0.009015691
max score: 1.339
status 1 (db_unfetched): 790
status 2 (db_fetched): 126
status 4 (db_redir_temp): 19
status 5 (db_redir_perm): 21
CrawlDb statistics: done


and

*$ bin/nutch readseg -list crawl-dir/segments/**
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20110730005815 3 2011-07-30T00:58:18 2011-07-30T00:58:18 3 3
20110730005828 163 2011-07-30T00:58:30 2011-07-30T01:05:32 201 123


When I got that error, the latter list shows that one (or more) segments is
not finished well. But now you can see my segments seem ok. What about
yours?

Dinçer


2011/8/1 Christian Weiske <ch...@netresearch.de>

> Hello Dinçer,
>
>
> > > Somewhere during the crawling process I get an error that stops
> > > everything:
> > >
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> > > Exception in thread "main"
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > > exist:
>
> > I have had same problem in one of my instances. Let's dig together, at
> > least. I have tried to re-crawl the url list into same crawl directory
> > (crawl-301 in your case) and got the same error, will you confirm for
> > your case?
>
> How do you re-crawl the list? Is there a specific URL list in the
> segment?
>
> --
> Viele Grüße
> Christian Weiske
>

Re: Error "Input path does not exist" when crawling

Posted by Christian Weiske <ch...@netresearch.de>.

Hello Dinçer,


> > Somewhere during the crawling process I get an error that stops
> > everything:
> > file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> > Exception in thread "main"
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:

> I have had same problem in one of my instances. Let's dig together, at
> least. I have tried to re-crawl the url list into same crawl directory
> (crawl-301 in your case) and got the same error, will you confirm for
> your case?

How do you re-crawl the list? Is there a specific URL list in the
segment?

-- 
Viele Grüße
Christian Weiske

Re: Error "Input path does not exist" when crawling

Posted by Dinçer Kavraal <dk...@gmail.com>.

Hi,

I have had same problem in one of my instances. Let's dig together, at
least. I have tried to re-crawl the url list into same crawl directory
(crawl-301 in your case) and got the same error, will you confirm for your
case?

Best,
Dincer

2011/8/1 Christian Weiske <ch...@netresearch.de>

> Hello,
>
>
> I setup nutch 1.3 to crawl our mediawiki instance.
> Somewhere during the crawling process I get an error that stops
> everything:
>
> ---------
> LinkDb: starting at 2011-08-01 09:27:51
> LinkDb: linkdb: crawl-301/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801084037
> LinkDb: adding segment:
> [20 more of that]
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801090707
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801083518/parse_data
> Input path does not exist:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091638/parse_data
> Input path does not exist:
>
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20110801091806/parse_data
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>        at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>        at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>        at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>        at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
>        at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> ---------
>
>
> What can I do to fix this?
>
>
> --
> Viele Grüße
> Christian Weiske
>

Re: Error "Input path does not exist" when crawling

Posted by Markus Jelsma <ma...@openindex.io>.

You either didn't parse segment 20110801091638 or an error has occured during 
parsing of that segment preventing it from completing. Also make sure you have 
no DiskChecker exceptions, usually meaning lack of /tmp disk space.

On Monday 01 August 2011 09:32:27 Christian Weiske wrote:
> Hello,
> 
> 
> I setup nutch 1.3 to crawl our mediawiki instance.
> Somewhere during the crawling process I get an error that stops
> everything:
> 
> ---------
> LinkDb: starting at 2011-08-01 09:27:51
> LinkDb: linkdb: crawl-301/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20
> 110801084037 LinkDb: adding segment:
> [20 more of that]
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20
> 110801090707 Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/2
> 0110801083518/parse_data Input path does not exist:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20
> 110801091638/parse_data Input path does not exist:
> file:/home/cweiske/bin/apache-nutch-1.3/runtime/local/crawl-301/segments/20
> 110801091806/parse_data at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
> 0) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
> utFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
> ) at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> 	at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> 	at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> 	at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> 	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
> 	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
> 	at org.apache.nutch.crawl.Crawl.run(Crawl.java:142)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.nutch.crawl.Crawl.main(Crawl.java:54)
> ---------
> 
> 
> What can I do to fix this?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350