You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Trejkaz (JIRA)" <ji...@apache.org> on 2005/10/26 02:13:55 UTC

[jira] Created: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

Merging may create duplicates if the JVM crashes half way through
-----------------------------------------------------------------

Key: LUCENE-458
URL: http://issues.apache.org/jira/browse/LUCENE-458
Project: Lucene - Java
Type: Bug
Versions: 1.4
Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this version. We've updated to 1.5.0_05 since, but discovered this issue with an older text index since.)

Reporter: Trejkaz

In the past, our indexing process crashed due to a Hotspot compiler bug on SMP systems (although it could happen with any bad native code.) Everything picked up and appeared to work, but now that it's a month later I've discovered an oddity in the text index.

We have two documents which are identical in the text index. I know we only stored it once for two reasons. First, we store the MD5 of every document into the hash and the MD5s were the same. Second, we store a GUID into each document which is generated uniquely for each document. The GUID and the MD5 hash on these two documents, as well as all other fields, is exactly the same.

My conclusion is that a merge was occurring at the point the JVM crashed, which is consistent with the time the process crashed. Is it possible that Lucene did the copy of this document to the new location, and didn't get to delete the original?

If so, I guess this issue should be prevented somehow.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

Posted by "paul.elschot (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-458?page=comments#action_12355924 ] 

paul.elschot commented on LUCENE-458:
-------------------------------------

Even in case you use this order:
- open a reader
- delete the orginal document on the reader
- close the reader 
- open a writer
- add the new copy of the doc on the writer
- close the writer
things may still go wrong if the crash happens around the
time of closing the reader and opening the writer.

The problem is that if the deletion information does not make
it to the operating system, there is no way to recover it.
Also Java does not guarantee that things make it to disk,
they normally do eventually, but this is out of control of the JVM.

In case you need more certainty about the deletion, you need
to use an operating system command that forces all buffers to disk
(for example the unix sync command) after closing
the reader on which the document was deleted.
But even then your disk may crash...

Regards,
Paul Elschot


> Merging may create duplicates if the JVM crashes half way through
> -----------------------------------------------------------------
>
>          Key: LUCENE-458
>          URL: http://issues.apache.org/jira/browse/LUCENE-458
>      Project: Lucene - Java
>         Type: Bug
>     Versions: 1.4
>  Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this version.  We've updated to 1.5.0_05 since, but discovered this issue with an older text index since.)
>     Reporter: Trejkaz

>
> In the past, our indexing process crashed due to a Hotspot compiler bug on SMP systems (although it could happen with any bad native code.)  Everything picked up and appeared to work, but now that it's a month later I've discovered an oddity in the text index.
> We have two documents which are identical in the text index.  I know we only stored it once for two reasons.  First, we store the MD5 of every document into the hash and the MD5s were the same.  Second, we store a GUID into each document which is generated uniquely for each document.  The GUID and the MD5 hash on these two documents, as well as all other fields, is exactly the same.
> My conclusion is that a merge was occurring at the point the JVM crashed, which is consistent with the time the process crashed.  Is it possible that Lucene did the copy of this document to the new location, and didn't get to delete the original?
> If so, I guess this issue should be prevented somehow.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch resolved LUCENE-458.
----------------------------------

    Resolution: Duplicate

The problem here apparently is that when the JVM crashed not all files are properly synced with the FS.
This seems to be a similar problem to LUCENE-1044. 

> Merging may create duplicates if the JVM crashes half way through
> -----------------------------------------------------------------
>
>                 Key: LUCENE-458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-458
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 1.4
>         Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this version.  We've updated to 1.5.0_05 since, but discovered this issue with an older text index since.)
>            Reporter: Trejkaz
>
> In the past, our indexing process crashed due to a Hotspot compiler bug on SMP systems (although it could happen with any bad native code.)  Everything picked up and appeared to work, but now that it's a month later I've discovered an oddity in the text index.
> We have two documents which are identical in the text index.  I know we only stored it once for two reasons.  First, we store the MD5 of every document into the hash and the MD5s were the same.  Second, we store a GUID into each document which is generated uniquely for each document.  The GUID and the MD5 hash on these two documents, as well as all other fields, is exactly the same.
> My conclusion is that a merge was occurring at the point the JVM crashed, which is consistent with the time the process crashed.  Is it possible that Lucene did the copy of this document to the new location, and didn't get to delete the original?
> If so, I guess this issue should be prevented somehow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

Posted by "paul.elschot (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-458?page=comments#action_12356051 ] 

paul.elschot commented on LUCENE-458:
-------------------------------------

I don't think that Lucene supports marking a document as in step 3.

In case the concern is that the deleted document would be temporarliy
invisible to searches, it is possible to keep another reader open for searching
while the updates are going on, and this reader will see no changes at all.

Lucene does this by never changing an index segment, it only adds
deletion bits, and these are taken into account when segments are
merged to add docs or to optimize the index. Also, the deletion bits added
by reader B are ignored by reader A when they are not present
when A is opened.
I don't know what happens when a segment has deletion bits when reader A
is opened, and reader B that was opened later deletes more documents.
This situation can be avoided by only deleting documents on optimized indexes.

Anyway, to see the changes after all updates: close the searching reader and
reopen another one to search on the updated index.
Before opening the new reader for searching, but after closing the
last reader/writer that changed the index, there is an opportunity
to sync the disk(s).

Regards,
Paul Elschot


> Merging may create duplicates if the JVM crashes half way through
> -----------------------------------------------------------------
>
>          Key: LUCENE-458
>          URL: http://issues.apache.org/jira/browse/LUCENE-458
>      Project: Lucene - Java
>         Type: Bug
>     Versions: 1.4
>  Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this version.  We've updated to 1.5.0_05 since, but discovered this issue with an older text index since.)
>     Reporter: Trejkaz

>
> In the past, our indexing process crashed due to a Hotspot compiler bug on SMP systems (although it could happen with any bad native code.)  Everything picked up and appeared to work, but now that it's a month later I've discovered an oddity in the text index.
> We have two documents which are identical in the text index.  I know we only stored it once for two reasons.  First, we store the MD5 of every document into the hash and the MD5s were the same.  Second, we store a GUID into each document which is generated uniquely for each document.  The GUID and the MD5 hash on these two documents, as well as all other fields, is exactly the same.
> My conclusion is that a merge was occurring at the point the JVM crashed, which is consistent with the time the process crashed.  Is it possible that Lucene did the copy of this document to the new location, and didn't get to delete the original?
> If so, I guess this issue should be prevented somehow.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-458) Merging may create duplicates if the JVM crashes half way through

Posted by "Trejkaz (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-458?page=comments#action_12356029 ] 

Trejkaz commented on LUCENE-458:
--------------------------------

I was thinking more along the lines of...

1. open a reader, writer
2. read the document
3. write a marker marking that this document is the result of a move of another one
4. write the document
5. delete the original document
6. delete the marker
7. close the reader, writer

Then later on, when the reader opens an index and finds a marker, it goes and checks the location the marker points at, and if the location is still there, it continues from step 5 again.

> Merging may create duplicates if the JVM crashes half way through
> -----------------------------------------------------------------
>
>          Key: LUCENE-458
>          URL: http://issues.apache.org/jira/browse/LUCENE-458
>      Project: Lucene - Java
>         Type: Bug
>     Versions: 1.4
>  Environment: Windows XP SP2, JDK 1.5.0_04 (crash occurred in this version.  We've updated to 1.5.0_05 since, but discovered this issue with an older text index since.)
>     Reporter: Trejkaz

>
> In the past, our indexing process crashed due to a Hotspot compiler bug on SMP systems (although it could happen with any bad native code.)  Everything picked up and appeared to work, but now that it's a month later I've discovered an oddity in the text index.
> We have two documents which are identical in the text index.  I know we only stored it once for two reasons.  First, we store the MD5 of every document into the hash and the MD5s were the same.  Second, we store a GUID into each document which is generated uniquely for each document.  The GUID and the MD5 hash on these two documents, as well as all other fields, is exactly the same.
> My conclusion is that a merge was occurring at the point the JVM crashed, which is consistent with the time the process crashed.  Is it possible that Lucene did the copy of this document to the new location, and didn't get to delete the original?
> If so, I guess this issue should be prevented somehow.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org