You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2007/05/11 19:20:15 UTC

[jira] Created: (LUCENE-879) Document number integrity merge policy

Document number integrity merge policy
--------------------------------------

                 Key: LUCENE-879
                 URL: https://issues.apache.org/jira/browse/LUCENE-879
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Store
    Affects Versions: 2.1
            Reporter: Karl Wettin
            Priority: Minor
         Attachments: LUNCENE-879.diff

This patch allows for document numbers stays the same even after merge of segments with deletions.

Consumer needs to do this:
indexWriter.setSkipMergingDeletedDocuments(false);

The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.

Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495257 ] 

Karl Wettin commented on LUCENE-879:
------------------------------------

Nicolas Lalevée [12/May/07 01:16 AM]
> Karl, in your application, you store nothing in Lucene isn't it ?
> Does it cost so much to just store an field id in Lucene ? 

I have no clue how much CPU ticks or bits of RAM this might save me, I'll have to bench that later on. This is just me fooling around with technology solutions for fun, a proof of concept. There is no real project.

But it is not the cost that conserns me. It is having the data spread around diffrent layers. I want to use BDB as object storage, not Lucene.

> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff, LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Nicolas Lalevée (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495310 ] 

Nicolas Lalevée commented on LUCENE-879:
----------------------------------------

That was I was talking about, storing in Lucene just an ID referencing the data in another storage. So this Lucene-stored ID became the document-id you try to fix.

I have also done some experimentation about making the storage external, but I realized that what I was coding was exactly the same as storing an ID in Lucene. But I didn't tried to "fix" the document id.

> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff, LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Nicolas Lalevée (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495239 ] 

Nicolas Lalevée commented on LUCENE-879:
----------------------------------------

Karl, in your application, you store nothing in Lucene isn't it ?
Does it cost so much to just store an field id in Lucene ?


> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff, LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-879) Document number integrity merge policy

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-879:
-------------------------------

    Attachment: LUNCENE-879.diff

> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-879) Document number integrity merge policy

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-879:
-------------------------------

    Attachment: LUNCENE-879.diff

This new patch allows consumer to, based on a primary key, delete a document and add a new document with the same document number as the deleted. The events will occur on merging.

> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff, LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495145 ] 

Karl Wettin commented on LUCENE-879:
------------------------------------

Doron, thanks for the input. 

I have not had time to read and think everything though that you wrote yet, but I will tell you of what I'm doing and what I'm aiming at.

I use this patch in conjunction with an Oracle (Sleepycat) BDB object storage. The Lucene document number (LDN) is used as secondary key. I do no unmarshalling to object from data stored in Lucene fields, I only use it as an index. I never have to read the document from Lucene. I have no clue how much CPU ticks or bits of RAM this might save me, I'll have to bench that later on. This is just me fooling around with technology solutions for fun, a proof of concept. There is no real project.

When I update an instance of the object storage, I'll create a new document in Lucene and then update the LDN in the instace to be updated in the object storage, then delete the old document in Lucene.

Even though it works, I do not like this solution. I want to fully retain the document number integrity for updated document. I belive this can be solved if i limit the warranty to an index in an optimized state. 

An instance of DocumentIdentityFactory, capable of identifying and create queried to uniquely identify documents, will be passed to the SegmentMerger. It might look at field "_type" and "_pk", or so. 

As SegmentMerger.mergeFields reach a deleted document it will use the factory to find replacements for the deleted document in the index. The one with the top document number is latest one and thus the winner. This document will be added at the current position and added to a list of document number to treat as deleted. 

Ta-da, and there we have safe(tm) document numbers.


> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495103 ] 

Karl Wettin commented on LUCENE-879:
------------------------------------

Forgot to tell about all the effects:

1. Replaces deleted documents with a new Document()
2. Stores a null term frequency vector
3. Sets norm to Similarity.encodeNorm(0f)



> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-879) Document number integrity merge policy

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495127 ] 

Doron Cohen commented on LUCENE-879:
------------------------------------

I skimmed through the patch and I understand that all terms and postings 
of deleted docs are discarded, and, instead, an empty doc is added.

I would like to comment on the idea behind this.

I think that this satisfies part of (some) applications needs, 
assuming it is mainly documents updating that causes deletions.

For example, assume initial 5 documents {A,B,C,D,E}, their internal ids 
are {0,1,2,3,4}, and used as keys to consumer's secondary storage.

Now, docs B and D are updated - so the internal ids would change.
As of now, they become:  {A:0, C:1, E:2, B`:3, D`:4}.
With this patch, I believe they would become:  {A:0, _:1, C:2, _:3, E:4, B`:5, D`:6}.

So, accessing the secondary storage is now working nicely for the unchanged 
docs A, C, E, but the keys in the secondary storage have to be modified for the 
updated documents B and D.

This is probably not too bad, because the application updated the secondary 
storage anyhow, so why not updating the access key at the same
time - especially if the application keeps track of number of added documents.

I like this idea, but can see a few issues:

1) statistics are somewhat distorted - docCount used at search time 
    computations (idf) now (always) includes docs that were deleted. 

2) In the long run, norms size grow, so more memory is used.
     Eventually a merge-and-clean/squeeze might be required, but I guess the 
     application can do that in a controlled and efficient manner, updating the 
     secondary storage ids at the same time.

How about a different - more external - approach, not changing the internal-ids 
behavior, but rather using payloads for storing external IDs, and, when opening a 
new reader, reading (once) these IDs to an int array, that maps from
internal IDs to application IDs. This information is now readily available 
at search time for referencing the secondary repository. Having these IDs as 
payloads should allow to load them relatively fast, so hopefully warming a new 
reader would not be too slow as result of this. That was part 1 of the price of this 
approach. Part 2 is the memory taken for the IDs - 4 bytes per doc per reader.
Part 3 is the complexity of using this, but I didn't think of API yet.

Doron

> Document number integrity merge policy
> --------------------------------------
>
>                 Key: LUCENE-879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-879
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUNCENE-879.diff
>
>
> This patch allows for document numbers stays the same even after merge of segments with deletions.
> Consumer needs to do this:
> indexWriter.setSkipMergingDeletedDocuments(false);
> The effect will be that deleted documents are replaced by a new Document() in the merged segment, but not marked as deleted. This should probably be some policy thingy that allows for different solutions such as keeping the old document, et c.
> Also see http://www.nabble.com/optimization-behaviour-tf3723327.html#a10418880

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org