You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rafael Rossini <ra...@gmail.com> on 2007/07/25 20:38:09 UTC

Delete corrupted doc

Hi guys,

    Is there a way of deleting a document that, because of some corruption,
got and docID larger than the maxDoc() ? I´m trying to do this but I get
this Exception:

 Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
index out of range: 106577
   at org.apache.lucene.util.BitVector.set(BitVector.java:53)
   at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java
:301)
   at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java
:674)
   at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125)
   at org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java
:674)
   at teste.DeleteError.main(DeleteError.java:9)

Thanks

Re: Delete corrupted doc

Posted by Mark Miller <ma...@gmail.com>.

This may not be very elegant, but if you are really in a jam, here is what I
would try:

Check out a copy of Lucene. Modify the isDeleted method on both MultiReader
and SegmentReader so that it returns true if the docid passed in is the id
in question (if it is not the id, then just have the method do what it
would). This will keep Lucene from looking in the deleted bitvector and
causing an arrayoutbounds exception...instead it will just return that the
bad id has been removed. Then run a full optimize. When the new segments are
created, your bad doc should be ignored and not make it into the new
generation.

Kind of a pain, but if you really cannot reindex...

- Mark

On 7/25/07, Rafael Rossini <ra...@gmail.com> wrote:
>
> Hi guys,
>
>     Is there a way of deleting a document that, because of some
> corruption,
> got and docID larger than the maxDoc() ? I´m trying to do this but I get
> this Exception:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 106577
>    at org.apache.lucene.util.BitVector.set(BitVector.java:53)
>    at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java
> :301)
>    at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java
> :674)
>    at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125)
>    at org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java
> :674)
>    at teste.DeleteError.main(DeleteError.java:9)
>
> Thanks
>

Re: Delete corrupted doc

Posted by Yonik Seeley <yo...@apache.org>.

On 7/26/07, Rafael Rossini <ra...@gmail.com> wrote:
> Well... thanks for the help, this was really my last solution (rebuild) but
> I think I have no other choice... I really can´t tell exactly if this
> corruption was caused by bad hardware or not, but do you guys have any
> ideia about what might have happend here?

Unless it's reproducible, I'd chalk it up as a hardware blip.

> Could I have generated this corruption some how?

No, anything you can do through Lucene's API should result in a valid index.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Delete corrupted doc

Posted by Rafael Rossini <ra...@gmail.com>.

Well... thanks for the help, this was really my last solution (rebuild) but
I think I have no other choice... I really can´t tell exactly if this
corruption was caused by bad hardware or not, but do you guys have any
ideia about what might have happend here? Could I have generated this
corruption some how?

On 7/26/07, Yonik Seeley <yo...@apache.org> wrote:
>
> On 7/26/07, Mark Miller <ma...@gmail.com> wrote:
> > Anyway, what this says to me (and I should have realized this before) is
> > that there is no document with your corrupt id, rather there is a term
> that
> > thinks it is in that invalid doc id. The corruption must be in the
> > term:docids inverted index.
>
> Correct.  And the ids that *are* less than maxDoc may be incorrect
> (pointint to the wrong docs), and there could be other terms with the
> same issue.
> The index should be rebuilt.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Delete corrupted doc

Posted by Yonik Seeley <yo...@apache.org>.

On 7/26/07, Mark Miller <ma...@gmail.com> wrote:
> Anyway, what this says to me (and I should have realized this before) is
> that there is no document with your corrupt id, rather there is a term that
> thinks it is in that invalid doc id. The corruption must be in the
> term:docids inverted index.

Correct.  And the ids that *are* less than maxDoc may be incorrect
(pointint to the wrong docs), and there could be other terms with the
same issue.
The index should be rebuilt.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Delete corrupted doc

Posted by Mark Miller <ma...@gmail.com>.

>From what I can tell, you shouldn't need to even try my first suggestion
(what happened to the experts on this question by the way?).

Returning true from isDeleted for the corrupt id should not matter.

It appears to me that deletes are handled by keeping a simple list of the
id's that are deleted. When a merge is done and a new segment is created,
the deleted ids are just not brought along for the ride. Instead, just docs
0-maxdoc() with isDeleted=false are put into the new segment. Your corrupt
id that is greater than maxdoc() should not make the new segment as it will
never be retrieved.

Anyway, what this says to me (and I should have realized this before) is
that there is no document with your corrupt id, rather there is a term that
thinks it is in that invalid doc id. The corruption must be in the
term:docids inverted index.

Getting that invalid number out of that file might be rather difficult.
There are some brilliant guys on the list that might have an idea how to do
it though. Certainly my approach in that first e-mail will not do it.

I will try to think of something if no one chimes in. Obviously, re-index
will be the easiest solution <g>

- Mark

On 7/26/07, Rafael Rossini <ra...@gmail.com> wrote:
>
> Yes, I optimized, but in the with SOLR. I don´t know why, but when
> optimize
> an index with SOLR, it leaves you with about 15 files, instead of the 3...
> I´ll try to optimize directly on lucene, and see what happens, if nothing
> happens I´ll try your suggestion. Thanks a lot Mark!!
>
> On 7/26/07, Mark Miller <ma...@gmail.com> wrote:
> >
> > You know, on second though, a merge shouldn't even try to access a doc >
> > maxdoc (i think). Have you just tried an optimize?
> >
> > On 7/25/07, Rafael Rossini <ra...@gmail.com> wrote:
> > >
> > > Hi guys,
> > >
> > >     Is there a way of deleting a document that, because of some
> > > corruption,
> > > got and docID larger than the maxDoc() ? I´m trying to do this but I
> get
> > > this Exception:
> > >
> > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
> > Array
> > > index out of range: 106577
> > >    at org.apache.lucene.util.BitVector.set(BitVector.java:53)
> > >    at org.apache.lucene.index.SegmentReader.doDelete (
> SegmentReader.java
> > > :301)
> > >    at org.apache.lucene.index.IndexReader.deleteDocument(
> > IndexReader.java
> > > :674)
> > >    at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java
> :125)
> > >    at org.apache.lucene.index.IndexReader.deleteDocument (
> > IndexReader.java
> > > :674)
> > >    at teste.DeleteError.main(DeleteError.java:9)
> > >
> > > Thanks
> > >
> >
>

Re: Delete corrupted doc

Posted by Rafael Rossini <ra...@gmail.com>.

I see, thanks.

On 7/26/07, Mike Klaas <mi...@gmail.com> wrote:
>
>
> On 26-Jul-07, at 10:18 AM, Rafael Rossini wrote:
>
> > Yes, I optimized, but in the with SOLR. I don´t know why, but when
> > optimize
> > an index with SOLR, it leaves you with about 15 files, instead of
> > the 3...
>
> You are probably not using the compound file format.  Try setting:
>     <useCompoundFile>true</useCompoundFile>
>
> in solrconfig.xml
>
> -Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Delete corrupted doc

Posted by Mike Klaas <mi...@gmail.com>.

On 26-Jul-07, at 10:18 AM, Rafael Rossini wrote:

> Yes, I optimized, but in the with SOLR. I don´t know why, but when  
> optimize
> an index with SOLR, it leaves you with about 15 files, instead of  
> the 3...

You are probably not using the compound file format.  Try setting:
     <useCompoundFile>true</useCompoundFile>

in solrconfig.xml

-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Delete corrupted doc

Posted by Rafael Rossini <ra...@gmail.com>.

Yes, I optimized, but in the with SOLR. I don´t know why, but when optimize
an index with SOLR, it leaves you with about 15 files, instead of the 3...
I´ll try to optimize directly on lucene, and see what happens, if nothing
happens I´ll try your suggestion. Thanks a lot Mark!!

On 7/26/07, Mark Miller <ma...@gmail.com> wrote:
>
> You know, on second though, a merge shouldn't even try to access a doc >
> maxdoc (i think). Have you just tried an optimize?
>
> On 7/25/07, Rafael Rossini <ra...@gmail.com> wrote:
> >
> > Hi guys,
> >
> >     Is there a way of deleting a document that, because of some
> > corruption,
> > got and docID larger than the maxDoc() ? I´m trying to do this but I get
> > this Exception:
> >
> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
> Array
> > index out of range: 106577
> >    at org.apache.lucene.util.BitVector.set(BitVector.java:53)
> >    at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java
> > :301)
> >    at org.apache.lucene.index.IndexReader.deleteDocument(
> IndexReader.java
> > :674)
> >    at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125)
> >    at org.apache.lucene.index.IndexReader.deleteDocument (
> IndexReader.java
> > :674)
> >    at teste.DeleteError.main(DeleteError.java:9)
> >
> > Thanks
> >
>

Re: Delete corrupted doc

Posted by Mark Miller <ma...@gmail.com>.

You know, on second though, a merge shouldn't even try to access a doc >
maxdoc (i think). Have you just tried an optimize?

On 7/25/07, Rafael Rossini <ra...@gmail.com> wrote:
>
> Hi guys,
>
>     Is there a way of deleting a document that, because of some
> corruption,
> got and docID larger than the maxDoc() ? I´m trying to do this but I get
> this Exception:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 106577
>    at org.apache.lucene.util.BitVector.set(BitVector.java:53)
>    at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java
> :301)
>    at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java
> :674)
>    at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125)
>    at org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java
> :674)
>    at teste.DeleteError.main(DeleteError.java:9)
>
> Thanks
>