You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Nic M <ni...@gmail.com> on 2009/06/02 18:10:12 UTC
IOException in dedup
Hello,
I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for
Mac OS X. When I try to start crawling I get the following exception:
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
Does anyone know how to solve this problem?
Nic M
Re: IOException in dedup
Posted by MyD <my...@googlemail.com>.
I had the same problem when I forgot to add the URL field in the
index. Maybe u have the same problem.
Regards,
MyD
On Jun 3, 2009, at 1:13 AM, Nic M wrote:
>
> On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>
>>> Hello,
>>>
>>> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse
>>> for Mac OS X. When I try to start crawling I get the following
>>> exception:
>>>
>>> Dedup: starting
>>> Dedup: adding indexes in: crawl/indexes
>>> Exception in thread "main" java.io.IOException: Job failed!
>>> at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>> at
>>> org
>>> .apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
>>> 439)
>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>>
>>>
>>> Does anyone know how to solve this problem?
>>
>> You can get an IOException reported by Hadoop when the root cause
>> is that you've run out of memory. Normally the hadoop.log file
>> would have the OOM exception.
>>
>> If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9
>> for more details.
>>
>> -- Ken
>> --
>> Ken Krugler
>> +1 530-210-6378
>
> Thank you for the pointers Ken. I changed the VM memory parameters
> as shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9.
> However, I still get the exception and in Hadoop log I have the
> following exception
>
> 2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup:
> starting
> 2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup:
> adding indexes in: crawl/indexes
> 2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc
> java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:
> 113)
> at org.apache.nutch.indexer.DeleteDuplicates$InputFormat
> $DDRecordReader.next(DeleteDuplicates.java:176)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> at org.apache.hadoop.mapred.LocalJobRunner
> $Job.run(LocalJobRunner.java:126)
>
> I am running Lucene 2.1.0. Any idea why I am getting the
> ArrayIndexOutofBoundsEception?
>
> Nic
>
>
>
Re: IOException in dedup
Posted by Nic M <ni...@gmail.com>.
I used the patch and everything seems to be working fine at the
moment. Thanks Dogacan.
Nic M
On Jun 3, 2009, at 12:07 PM, Doğacan Güney wrote:
> On Tue, Jun 2, 2009 at 20:13, Nic M <ni...@gmail.com> wrote:
>
> On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>
>>> Hello,
>>>
>>> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse
>>> for Mac OS X. When I try to start crawling I get the following
>>> exception:
>>>
>>> Dedup: starting
>>> Dedup: adding indexes in: crawl/indexes
>>> Exception in thread "main" java.io.IOException: Job failed!
>>> at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>> at
>>> org
>>> .apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
>>> 439)
>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>>
>>>
>>> Does anyone know how to solve this problem?
>>
>
>
> You may be running into this problem:
>
> https://issues.apache.org/jira/browse/NUTCH-525
>
> I suggest trying updating to 1.0 or applying the patch there.
>
>>
>> You can get an IOException reported by Hadoop when the root cause
>> is that you've run out of memory. Normally the hadoop.log file
>> would have the OOM exception.
>>
>> If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9
>> for more details.
>>
>> -- Ken
>> --
>> Ken Krugler
>> +1 530-210-6378
>
> Thank you for the pointers Ken. I changed the VM memory parameters
> as shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9.
> However, I still get the exception and in Hadoop log I have the
> following exception
>
> 2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup:
> starting
> 2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup:
> adding indexes in: crawl/indexes
> 2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc
> java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:
> 113)
> at org.apache.nutch.indexer.DeleteDuplicates$InputFormat
> $DDRecordReader.next(DeleteDuplicates.java:176)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> at org.apache.hadoop.mapred.LocalJobRunner
> $Job.run(LocalJobRunner.java:126)
>
> I am running Lucene 2.1.0. Any idea why I am getting the
> ArrayIndexOutofBoundsEception?
>
> Nic
>
>
>
>
>
>
> --
> Doğacan Güney
Re: IOException in dedup
Posted by Doğacan Güney <do...@gmail.com>.
On Tue, Jun 2, 2009 at 20:13, Nic M <ni...@gmail.com> wrote:
>
> On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>
> Hello,
>
>
> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for Mac OS
> X. When I try to start crawling I get the following exception:
>
>
> Dedup: starting
>
> Dedup: adding indexes in: crawl/indexes
>
> Exception in thread "main" java.io.IOException: Job failed!
>
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>
> at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
>
> Does anyone know how to solve this problem?
>
>
You may be running into this problem:
https://issues.apache.org/jira/browse/NUTCH-525
I suggest trying updating to 1.0 or applying the patch there.
>
> You can get an IOException reported by Hadoop when the root cause is that
> you've run out of memory. Normally the hadoop.log file would have the OOM
> exception.
>
> If you're running from inside of Eclipse, see
> http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.
>
> -- Ken
>
> --
>
> Ken Krugler
> +1 530-210-6378
>
>
> Thank you for the pointers Ken. I changed the VM memory parameters as shown
> at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I still get
> the exception and in Hadoop log I have the following exception
>
> 2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup: starting
> 2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup: adding
> indexes in: crawl/indexes
> 2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc
> java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
> at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
>
> I am running Lucene 2.1.0. Any idea why I am getting the
> ArrayIndexOutofBoundsEception?
>
> Nic
>
>
>
>
--
Doğacan Güney
Re: IOException in dedup
Posted by Ken Krugler <kk...@transpac.com>.
>On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>
>>>Hello,
>>>
>>>
>>>I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse
>>>for Mac OS X. When I try to start crawling I get the following
>>>exception:
>>>
>>>
>>>Dedup: starting
>>>
>>>Dedup: adding indexes in: crawl/indexes
>>>
>>>Exception in thread "main" java.io.IOException: Job failed!
>>>
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>>
>>> at
>>>org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>>>
>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>>
>>>
>>>
>>>Does anyone know how to solve this problem?
>>>
>>
>>You can get an IOException reported by Hadoop when the root cause
>>is that you've run out of memory. Normally the hadoop.log file
>>would have the OOM exception.
>>
>>If you're running from inside of Eclipse,
>>see <http://wiki.apache.org/nutch/RunNutchInEclipse0.9>http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for
>>more details.
>>
>>-- Ken
>>--
>>Ken Krugler
>>+1 530-210-6378
>>
>
>Thank you for the pointers Ken. I changed the VM memory parameters
>as shown
>at <http://wiki.apache.org/nutch/RunNutchInEclipse0.9>http://wiki.apache.org/nutch/RunNutchInEclipse0.9.
>However, I still get the exception and in Hadoop log I have the
>following exception
>
>2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup: starting
>2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup:
>adding indexes in: crawl/indexes
>2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc
>java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
> at
>org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
>
>I am running Lucene 2.1.0. Any idea why I am getting the
>ArrayIndexOutofBoundsEception?
Most likely is that the index has been corrupted. If you can, try
opening it using Luke.
-- Ken
--
Ken Krugler
+1 530-210-6378
Re: IOException in dedup
Posted by Nic M <ni...@gmail.com>.
On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>> Hello,
>>
>> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for
>> Mac OS X. When I try to start crawling I get the following exception:
>>
>> Dedup: starting
>> Dedup: adding indexes in: crawl/indexes
>> Exception in thread "main" java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
>> 604)
>> at
>> org
>> .apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
>> 439)
>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>
>>
>> Does anyone know how to solve this problem?
>
> You can get an IOException reported by Hadoop when the root cause is
> that you've run out of memory. Normally the hadoop.log file would
> have the OOM exception.
>
> If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9
> for more details.
>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
Thank you for the pointers Ken. I changed the VM memory parameters as
shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I
still get the exception and in Hadoop log I have the following exception
2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup: starting
2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: crawl/indexes
2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at org.apache.nutch.indexer.DeleteDuplicates$InputFormat
$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at org.apache.hadoop.mapred.LocalJobRunner
$Job.run(LocalJobRunner.java:126)
I am running Lucene 2.1.0. Any idea why I am getting the
ArrayIndexOutofBoundsEception?
Nic
Re: IOException in dedup
Posted by Ken Krugler <kk...@transpac.com>.
>Hello,
>
>I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for
>Mac OS X. When I try to start crawling I get the following exception:
>
>Dedup: starting
>Dedup: adding indexes in: crawl/indexes
>Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at
>org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
>Does anyone know how to solve this problem?
You can get an IOException reported by Hadoop when the root cause is
that you've run out of memory. Normally the hadoop.log file would
have the OOM exception.
If you're running from inside of Eclipse, see
http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.
-- Ken
--
Ken Krugler
+1 530-210-6378