You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "minhthucpham (JIRA)" <ji...@apache.org> on 2009/03/19 14:25:50 UTC
[jira] Commented: (NUTCH-525) DeleteDuplicates generates
ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment
[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683463#action_12683463 ]
minhthucpham commented on NUTCH-525:
------------------------------------
Can anyone guide me how to install the deleteDups.patch ?? I has downloaded it but don't know how to install.
I use cygwin for window and my jdk is jdk1.6.0_07.
Thanks very much
> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment
> -------------------------------------------------------------------------------------------------
>
> Key: NUTCH-525
> URL: https://issues.apache.org/jira/browse/NUTCH-525
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Environment: Fedora OS, JDK 1.6, Hadoop FS
> Reporter: Vishal Shah
> Fix For: 1.0.0
>
> Attachments: deleteDups.patch, RededupUnitTest.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
> at org.apache.lucene.util.BitVector.get(BitVector.java:72)
> at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
> at org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time when this will happen is when we are rerunning dedup on a segment. If documents are not deleted for any reason other than dedup, then they should be given a chance to compete again, isn't it? We could fix this by putting an indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.