You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ben Ogle <og...@gmail.com> on 2006/09/07 22:16:52 UTC

IOException: not a file with invertlinks/index

Hi all, I am having problems recrawling our intranet. Something in the
recrawl script (is it invertlinks?) creates a
crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000
folder under that. When the indexer is invoked, it looks for
crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt
exist cause its in the part-00000 directory. How do I get the indexer to
look in the part-00000 dir? Is it a configuration error? 

I am running a python port of recrawl script on a windows 2000 machine
without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server
that I have very limited access to. Heres what the hadoop.log says about it:

2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: starting
2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: linkdb:
F:/nutch-0.8/intranet-crawl/linkdb
2006-09-07 13:02:40,696 INFO  indexer.Indexer - Indexer: adding segment:
F:/nutch-0.8/intranet-crawl/segments/20060907130151
2006-09-07 13:02:50,804 WARN  mapred.LocalJobRunner - job_fn20sr
java.io.IOException: Not a file:
F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
	at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)

If I move the contents of linkdb-merge-216906667/part-00000 to
linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but
thats another issue).

The same thing happens when this linkdb-merge-* directory exists already and
I run invertlinks. 

What am I doing wrong? I havent been able to find anyone with these issues,
so I must be doing something wrong.

Ben
-- 
View this message in context: http://www.nabble.com/IOException%3A-not-a-file-with-invertlinks-index-tf2235304.html#a6197542
Sent from the Nutch - User forum at Nabble.com.


Re: IOException: not a file with invertlinks/index

Posted by maximus1 <iw...@gmail.com>.
Hey Ben, 

DId you find a solution? I'm having the same problem with cygwin and
nutch-0.9

Thanks mate
Cornelius



Ben Ogle wrote:
> 
> Hi all, I am having problems recrawling our intranet. Something in the
> recrawl script (is it invertlinks?) creates a
> crawldir\linkdb\current\linkdb-merge-<number> folder which has a
> part-00000 folder under that. When the indexer is invoked, it looks for
> crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt
> exist cause its in the part-00000 directory. How do I get the indexer to
> look in the part-00000 dir? Is it a configuration error? 
> 
> I am running a python port of recrawl script on a windows 2000 machine
> without cygwin, where the crawldir and nutch 0.8 is on a windows 2003
> server that I have very limited access to. Heres what the hadoop.log says
> about it:
> 
> 2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: starting
> 2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: linkdb:
> F:/nutch-0.8/intranet-crawl/linkdb
> 2006-09-07 13:02:40,696 INFO  indexer.Indexer - Indexer: adding segment:
> F:/nutch-0.8/intranet-crawl/segments/20060907130151
> 2006-09-07 13:02:50,804 WARN  mapred.LocalJobRunner - job_fn20sr
> java.io.IOException: Not a file:
> F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
> 	at
> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> 
> If I move the contents of linkdb-merge-216906667/part-00000 to
> linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but
> thats another issue).
> 
> The same thing happens when this linkdb-merge-* directory exists already
> and I run invertlinks. 
> 
> What am I doing wrong? I havent been able to find anyone with these
> issues, so I must be doing something wrong.
> 
> Ben
> 

-- 
View this message in context: http://www.nabble.com/IOException%3A-not-a-file-with-invertlinks-index-tp6197542p14309409.html
Sent from the Nutch - User mailing list archive at Nabble.com.