You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Brandon Mintern <mi...@easyesi.com> on 2011/12/07 21:46:20 UTC

Split mutable logical document into two Lucene documents

We have a document tagging system where documents are composed of two
types of data:

Rarely changed (hereafter: "immutable") data - document text and
metadata that we upload and almost never change. The text can be
hundreds of pages.

User created (hereafter: "mutable") data - document properties that
are set by users of our system. In total a document's properties are
generally several dozen bytes at most. Even viewing a document changes
the data (e.g. the document's "viewed" property.


At present, all data is part of a single Lucene document. The problem
is that when any piece of mutable data is updated (this happens
relatively frequently), we have to reindex the entire document. We'd
like to have two separate indexed Lucene documents per logical
document, one containing the immutable data and the other containing
the much smaller and more transient mutable data. When the mutable
data changes, we can delete that document's mutable Lucene document
and index a new one very quickly.

There are two major difficulties when actually performing a search, though:

1. We are providing complex queries to retrieve logical documents
based on information in either of its Lucene documents. It seems
non-trivial to fetch a logical document in a BooleanQuery with
Occur.MUST clauses referring to fields in both of the Lucene
documents.

2. We need to sort results (logical document IDs) based on fields in
either of its Lucene documents.

Has anyone done anything like this before? Is there functionality I'm
overlooking that could make this easier?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Split mutable logical document into two Lucene documents

Posted by Brandon Mintern <mi...@easyesi.com>.
Thank you for the pointer. I looked into nested documents, but it
appears that the implementation relies on each parent document being
indexed immediately before all of its children. Unfortunately, this
presents two problems:

1. Any optimize operation will break nesting
2. Deleting and reindexing a child would break the parent-child
hierarchy unless the parent was reindexed as well. Since this is the
problem we're trying to solve in the first place, this doesn't seem to
get us where we need to be.

We also looked at ParallelReader, but that requires the
immutable/mutable pair are added to the exact same position in
separate indexes. This is very brittle for our use, and it would
require rebuilding the entire mutable index just to change a single
value, or reindexing both the mutable and immutable information.
Neither solution is better than just keeping the mutable and immutable
data together.

I think there are some things we could do with filters, but I think it
will be easier and more flexible for us to have simple Lucene queries
return a sorted list of document IDs (our full document identifier)
and then perform set-union, set-intersection, and set-inversion
ourselves.

Thanks for your time,
Brandon

On Thu, Dec 8, 2011 at 9:57 AM, Ian Lea <ia...@gmail.com> wrote:
> It is conceivable that nested documents might help.
> https://issues.apache.org/jira/browse/LUCENE-2454.  I don't know
> anything about that so might be way off target.
>
>
> --
> Ian.
>
>
> On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern <mi...@easyesi.com> wrote:
>> We have a document tagging system where documents are composed of two
>> types of data:
>>
>> Rarely changed (hereafter: "immutable") data - document text and
>> metadata that we upload and almost never change. The text can be
>> hundreds of pages.
>>
>> User created (hereafter: "mutable") data - document properties that
>> are set by users of our system. In total a document's properties are
>> generally several dozen bytes at most. Even viewing a document changes
>> the data (e.g. the document's "viewed" property.
>>
>>
>> At present, all data is part of a single Lucene document. The problem
>> is that when any piece of mutable data is updated (this happens
>> relatively frequently), we have to reindex the entire document. We'd
>> like to have two separate indexed Lucene documents per logical
>> document, one containing the immutable data and the other containing
>> the much smaller and more transient mutable data. When the mutable
>> data changes, we can delete that document's mutable Lucene document
>> and index a new one very quickly.
>>
>> There are two major difficulties when actually performing a search, though:
>>
>> 1. We are providing complex queries to retrieve logical documents
>> based on information in either of its Lucene documents. It seems
>> non-trivial to fetch a logical document in a BooleanQuery with
>> Occur.MUST clauses referring to fields in both of the Lucene
>> documents.
>>
>> 2. We need to sort results (logical document IDs) based on fields in
>> either of its Lucene documents.
>>
>> Has anyone done anything like this before? Is there functionality I'm
>> overlooking that could make this easier?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Split mutable logical document into two Lucene documents

Posted by Ian Lea <ia...@gmail.com>.
It is conceivable that nested documents might help.
https://issues.apache.org/jira/browse/LUCENE-2454.  I don't know
anything about that so might be way off target.


--
Ian.


On Wed, Dec 7, 2011 at 8:46 PM, Brandon Mintern <mi...@easyesi.com> wrote:
> We have a document tagging system where documents are composed of two
> types of data:
>
> Rarely changed (hereafter: "immutable") data - document text and
> metadata that we upload and almost never change. The text can be
> hundreds of pages.
>
> User created (hereafter: "mutable") data - document properties that
> are set by users of our system. In total a document's properties are
> generally several dozen bytes at most. Even viewing a document changes
> the data (e.g. the document's "viewed" property.
>
>
> At present, all data is part of a single Lucene document. The problem
> is that when any piece of mutable data is updated (this happens
> relatively frequently), we have to reindex the entire document. We'd
> like to have two separate indexed Lucene documents per logical
> document, one containing the immutable data and the other containing
> the much smaller and more transient mutable data. When the mutable
> data changes, we can delete that document's mutable Lucene document
> and index a new one very quickly.
>
> There are two major difficulties when actually performing a search, though:
>
> 1. We are providing complex queries to retrieve logical documents
> based on information in either of its Lucene documents. It seems
> non-trivial to fetch a logical document in a BooleanQuery with
> Occur.MUST clauses referring to fields in both of the Lucene
> documents.
>
> 2. We need to sort results (logical document IDs) based on fields in
> either of its Lucene documents.
>
> Has anyone done anything like this before? Is there functionality I'm
> overlooking that could make this easier?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org