You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ben Ogle <og...@gmail.com> on 2006/09/07 22:16:52 UTC
IOException: not a file with invertlinks/index
Hi all, I am having problems recrawling our intranet. Something in the
recrawl script (is it invertlinks?) creates a
crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000
folder under that. When the indexer is invoked, it looks for
crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt
exist cause its in the part-00000 directory. How do I get the indexer to
look in the part-00000 dir? Is it a configuration error?
I am running a python port of recrawl script on a windows 2000 machine
without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server
that I have very limited access to. Heres what the hadoop.log says about it:
2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: starting
2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: linkdb:
F:/nutch-0.8/intranet-crawl/linkdb
2006-09-07 13:02:40,696 INFO indexer.Indexer - Indexer: adding segment:
F:/nutch-0.8/intranet-crawl/segments/20060907130151
2006-09-07 13:02:50,804 WARN mapred.LocalJobRunner - job_fn20sr
java.io.IOException: Not a file:
F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
If I move the contents of linkdb-merge-216906667/part-00000 to
linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but
thats another issue).
The same thing happens when this linkdb-merge-* directory exists already and
I run invertlinks.
What am I doing wrong? I havent been able to find anyone with these issues,
so I must be doing something wrong.
Ben
--
View this message in context: http://www.nabble.com/IOException%3A-not-a-file-with-invertlinks-index-tf2235304.html#a6197542
Sent from the Nutch - User forum at Nabble.com.
Re: IOException: not a file with invertlinks/index
Posted by maximus1 <iw...@gmail.com>.
Hey Ben,
DId you find a solution? I'm having the same problem with cygwin and
nutch-0.9
Thanks mate
Cornelius
Ben Ogle wrote:
>
> Hi all, I am having problems recrawling our intranet. Something in the
> recrawl script (is it invertlinks?) creates a
> crawldir\linkdb\current\linkdb-merge-<number> folder which has a
> part-00000 folder under that. When the indexer is invoked, it looks for
> crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt
> exist cause its in the part-00000 directory. How do I get the indexer to
> look in the part-00000 dir? Is it a configuration error?
>
> I am running a python port of recrawl script on a windows 2000 machine
> without cygwin, where the crawldir and nutch 0.8 is on a windows 2003
> server that I have very limited access to. Heres what the hadoop.log says
> about it:
>
> 2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: starting
> 2006-09-07 13:02:39,696 INFO indexer.Indexer - Indexer: linkdb:
> F:/nutch-0.8/intranet-crawl/linkdb
> 2006-09-07 13:02:40,696 INFO indexer.Indexer - Indexer: adding segment:
> F:/nutch-0.8/intranet-crawl/segments/20060907130151
> 2006-09-07 13:02:50,804 WARN mapred.LocalJobRunner - job_fn20sr
> java.io.IOException: Not a file:
> F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
> at
> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
>
> If I move the contents of linkdb-merge-216906667/part-00000 to
> linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but
> thats another issue).
>
> The same thing happens when this linkdb-merge-* directory exists already
> and I run invertlinks.
>
> What am I doing wrong? I havent been able to find anyone with these
> issues, so I must be doing something wrong.
>
> Ben
>
--
View this message in context: http://www.nabble.com/IOException%3A-not-a-file-with-invertlinks-index-tp6197542p14309409.html
Sent from the Nutch - User mailing list archive at Nabble.com.