You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael van Rooyen <mv...@bigfoot.com> on 2006/02/19 23:06:42 UTC

Index missing documents

While building a large index, we had a power outage.  Over 2 million 
documents had been added, each document with up to about 20 fields.  The 
size of the index on disk is ~500MB.  When I started the process up again, I 
noticed that documents that should have been in the index were missing.  In 
retrospect, I think that Lucene was seeing the index as being completely 
empty (it now says there are 385 docs in the index, but all of those have 
been added since the power outage).  The size on disk is still ~500MB.  Does 
anyone have an idea what might cause the documents to dissappear, and what 
can be done to get them back?  Rebuilding takes a while at 100ms per 
document, but it's a bit more concerning if such a outage or crash could 
cause documents to mysteriously dissapear from the index...


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index missing documents

Posted by Michael van Rooyen <mv...@bigfoot.com>.
I'm using Lucene 1.4.3, and maxBufferedDocs only appears to be in the new 
(unreleased?) version of IndexWriter in CVS.  Looking at the code though, 
setMaxBufferedDocs(n) just translates to minMergeDocs = n.  My index was 
constructed using the default minMergeDocs = 10, so somehow this doesn't 
seem to be the culprit that caused all 2 million+ documents to be missing 
from the crashed index.  It seems more likely that none of the index files 
were "registered" in Lucene's segements file.  Is there perhaps some other 
trigger that causes Lucene to "register" the indexes in the segments file, 
or is there some way of flushing the segments file every so often to ensure 
that it's list is up to date?  Thanks again for your assistance.

Michael.

----- Original Message ----- 
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 20, 2006 8:39 PM
Subject: Re: Index missing documents


> No, using the same IndexWriter is the way to go.  If you want things to be 
> written to disk more frequently, lower the maxBufferedDocs setting.  Go 
> down to 1, if you want.  You'll use less memory (RAM), Documents will be 
> written to disk without getting buffered in RAM, but the indexing process 
> will be slower.
>
> Otis
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index missing documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
No, using the same IndexWriter is the way to go.  If you want things to be written to disk more frequently, lower the maxBufferedDocs setting.  Go down to 1, if you want.  You'll use less memory (RAM), Documents will be written to disk without getting buffered in RAM, but the indexing process will be slower.

Otis

----- Original Message ----
From: Michael van Rooyen <mv...@bigfoot.com>
To: java-user@lucene.apache.org; Otis Gospodnetic <ot...@yahoo.com>
Sent: Monday, February 20, 2006 3:20:22 AM
Subject: Re: Index missing documents

Thanks Otis.  All the documents were written in a using the same 
IndexWriter, without ever closing it.  Is this what could be responsible for 
the documents not being in the segmens file, or is this bad practice?  Maybe 
I should use a writer for a batch of documents (1000 or so maybe?), and then 
discard it and start with a fresh one.  Would this help?





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index missing documents

Posted by Michael van Rooyen <mv...@bigfoot.com>.
Thanks Otis.  All the documents were written in a using the same 
IndexWriter, without ever closing it.  Is this what could be responsible for 
the documents not being in the segmens file, or is this bad practice?  Maybe 
I should use a writer for a batch of documents (1000 or so maybe?), and then 
discard it and start with a fresh one.  Would this help?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index missing documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
It is possible that your Documents were added to various index files, but those were not yet "registered" in the "segments" file.  Lucene knows only about index segments that are listed in segments file.  Any other files in the index directories are ignored.

Also, some Documents are kept in memory while indexing (see maxBufferedDocs in IndexWriter), so if a power outage happened before they were written to disk, they would be lost, too.

Otis

----- Original Message ----
From: Michael van Rooyen <mv...@bigfoot.com>
To: java-user@lucene.apache.org
Sent: Sunday, February 19, 2006 5:06:42 PM
Subject: Index missing documents

While building a large index, we had a power outage.  Over 2 million 
documents had been added, each document with up to about 20 fields.  The 
size of the index on disk is ~500MB.  When I started the process up again, I 
noticed that documents that should have been in the index were missing.  In 
retrospect, I think that Lucene was seeing the index as being completely 
empty (it now says there are 385 docs in the index, but all of those have 
been added since the power outage).  The size on disk is still ~500MB.  Does 
anyone have an idea what might cause the documents to dissappear, and what 
can be done to get them back?  Rebuilding takes a while at 100ms per 
document, but it's a bit more concerning if such a outage or crash could 
cause documents to mysteriously dissapear from the index...


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org