You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jong Kim <jo...@gmail.com> on 2012/04/23 17:31:58 UTC

Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Hi,

I'm sure that this is very common use case that probably hundreds of people
have asked the same question in the past, but I haven't been able to find
an exact answer to my question.

I have a system where each document in the Lucene index comprises of at
least one field containing very large number of terms (for example, entire
text from the content of potentially very large text files) and another
metadata field that is much smaller. The first field is rarely modified
hence remains mostly static, while the second field is modified very
frequently.

Currently, I'm re-indexing the entire Lucene document whenever the value of
the second field changes from the source side. Needless to say, this yields
very inefficient system, because significant amount of the system resources
are being wasted in effectively re-indexing what has not changed.

Is there any good way to solve this design problem? Obviously, an
alternative design would be to split the index into two, and maintain
static (and large) data in one index and the other dynamic part in the
other index. However, this approach is not acceptable due to our data
pattern where the match on the first index yields very large result set,
and filtering them against the second index is very inefficient due to high
ratio of disjoint data. In other word, while the alternate approach
significantly reduces the indexing-time overhead, resulting search is
unacceptably expensive.

Any design help would be highly appreciated.

Thanks
/Jong

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 25/04/2012 13:58, Erick Erickson wrote:
> There's no update-in-place, currently you _have_ to re-index the
> entire document.
>
> But to the original question:
>
> There is a "limited join" capability you might investigate that would
> allow you to split up the textual data and metadata into two different
> documents and join them. I don't know how well it scales, but it may
> fit your needs.
>
> It turns out that update-in-place is more than a bit difficult given the
> nature of the inverted index. There are some proposals for addressing
> this, but nothing has gotten beyond the design stage as far as I know.

LUCENE-3837, to be specific. But as you said, it's still early and there 
is no code yet to speak of...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Erick Erickson <er...@gmail.com>.

There's no update-in-place, currently you _have_ to re-index the
entire document.

But to the original question:

There is a "limited join" capability you might investigate that would
allow you to split up the textual data and metadata into two different
documents and join them. I don't know how well it scales, but it may
fit your needs.

It turns out that update-in-place is more than a bit difficult given the
nature of the inverted index. There are some proposals for addressing
this, but nothing has gotten beyond the design stage as far as I know.

Best
Erick

On Wed, Apr 25, 2012 at 3:07 AM, Torsten Krah
<tk...@fachschaft.imn.htwk-leipzig.de> wrote:
> Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
>> Simple Techniques is  to use  "Update Index"  for the dynamic data
>> colum
>>
>> rather then re-indexing the whole document.
>
> Just for interest, how do you do that?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Torsten Krah <tk...@fachschaft.imn.htwk-leipzig.de>.

Am Donnerstag, den 26.04.2012, 09:46 +0530 schrieb KARTHIK SHIVAKUMAR:
> Then delete the same and insert the same Fresh Document alone.

But that is not "update" like the question was - that is a complete
reindex of the original document, the original question was, if updating
a field of a doc can be done without reindex the complete doc.
Afaik this can't be done and Erick confirmed this too - so i wonder how
you did that, thats why i was asking ;-).

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.

Hi

>>"Update Index"  for the dynamic data

I have done this in Past ..It  worked for me long time ago,

All u need is have a piece of  Code to Search and find the Specific Doc
within the Index's  ( probably using the Unique name for document )
Then delete the same and insert the same Fresh Document alone.

All of this need to be done in Iteration for large set of docs.

with regards
karthik

On Wed, Apr 25, 2012 at 12:37 PM, Torsten Krah <
tkrah@fachschaft.imn.htwk-leipzig.de> wrote:

> Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
> > Simple Techniques is  to use  "Update Index"  for the dynamic data
> > colum
> >
> > rather then re-indexing the whole document.
>
> Just for interest, how do you do that?
>

-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Torsten Krah <tk...@fachschaft.imn.htwk-leipzig.de>.

Am Dienstag, den 24.04.2012, 21:57 +0530 schrieb KARTHIK SHIVAKUMAR:
> Simple Techniques is  to use  "Update Index"  for the dynamic data
> colum
> 
> rather then re-indexing the whole document. 

Just for interest, how do you do that?

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by KARTHIK SHIVAKUMAR <ns...@gmail.com>.

Hi

Simple Techniques is  to use  "Update Index"  for the dynamic data colum

rather then re-indexing the whole document.




with regards
karthik

On Mon, Apr 23, 2012 at 9:01 PM, Jong Kim <jo...@gmail.com> wrote:

> Hi,
>
> I'm sure that this is very common use case that probably hundreds of people
> have asked the same question in the past, but I haven't been able to find
> an exact answer to my question.
>
> I have a system where each document in the Lucene index comprises of at
> least one field containing very large number of terms (for example, entire
> text from the content of potentially very large text files) and another
> metadata field that is much smaller. The first field is rarely modified
> hence remains mostly static, while the second field is modified very
> frequently.
>
> Currently, I'm re-indexing the entire Lucene document whenever the value of
> the second field changes from the source side. Needless to say, this yields
> very inefficient system, because significant amount of the system resources
> are being wasted in effectively re-indexing what has not changed.
>
> Is there any good way to solve this design problem? Obviously, an
> alternative design would be to split the index into two, and maintain
> static (and large) data in one index and the other dynamic part in the
> other index. However, this approach is not acceptable due to our data
> pattern where the match on the first index yields very large result set,
> and filtering them against the second index is very inefficient due to high
> ratio of disjoint data. In other word, while the alternate approach
> significantly reduces the indexing-time overhead, resulting search is
> unacceptably expensive.
>
> Any design help would be highly appreciated.
>
> Thanks
> /Jong
>



-- 
*N.S.KARTHIK
R.M.S.COLONY
BEHIND BANK OF INDIA
R.M.V 2ND STAGE
BANGALORE
560094*

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Brandon Mintern <mi...@easyesi.com>.

On Mon, Apr 23, 2012 at 1:25 PM, Jong Kim <jo...@gmail.com> wrote:
> Thanks for the reply.
>
> Our metadata is not stored in a single field, but is rather a collection of
> fields. So, it requires a boolean search that spans multiple fields. My
> understanding is that it is not possible to iterate over the matching
> documents efficiently using termDocs() when the search involves multiple
> terms and/or multiple fields, right?
>
> /Jong

You can do this by defining your own hits Collector which simply pulls
the matching ID out of each result. Since searching the second index
returns less results, you could do something like this:

Two indexes:
LightWeight - stores metadata fields and document ID
HeavyWeight - stores static data and document ID

Search query:
1. Metadata portion: query LightWeight and retrieve all matching IDs
(NOT Lucene IDs, but your own stored document ID) in a gnu.trove
TIntSet

Now some queries won't even hit the second index, and you have your
full match. If you need to match against the 2nd index as well:

2. Pass in the TIntSet as an argument to another Collector.
3. For each match in the HeavyWeight index, if it is also in the
TIntSet, add it to the final TIntSet result set. Otherwise ignore it.
4. After the collector has been visited by each match, the final
result set is your hits.

You now have the set of document IDs for the complete match. Using
primitives and lightweight objects, this isn't much worse than letting
Lucene do the collection.

Of course, this approach only works if the intersection between
metadata and big data is an AND relationship. If you need other logic,
step 3 above obviously changes.

Another caveat is that if you are relying on Lucene to store and
return the full document for each query, this approach isn't the best
for fetching information out of Lucene. We use a standard relational
database for storing our data, we use Lucene to query for sets of
document IDs, and then we fetch the remaining document fields from our
DB (or in some cases, some information lives on S3, etc.).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Jong Kim <jo...@gmail.com>.

Thanks for the reply.

Our metadata is not stored in a single field, but is rather a collection of
fields. So, it requires a boolean search that spans multiple fields. My
understanding is that it is not possible to iterate over the matching
documents efficiently using termDocs() when the search involves multiple
terms and/or multiple fields, right?

/Jong

On Mon, Apr 23, 2012 at 11:58 AM, Earl Hood <ea...@earlhood.com> wrote:

> On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote:
>
> > Is there any good way to solve this design problem? Obviously, an
> > alternative design would be to split the index into two, and maintain
> > static (and large) data in one index and the other dynamic part in the
> > other index. However, this approach is not acceptable due to our data
> > pattern where the match on the first index yields very large result set,
> > and filtering them against the second index is very inefficient due to
> high
> > ratio of disjoint data. In other word, while the alternate approach
> > significantly reduces the indexing-time overhead, resulting search is
> > unacceptably expensive.
>
> Have you tested to verify it is expensive?  If the meta document is
> identified with a unique ID (that can be stored with the main document
> so you know which meta document to retrieve), accessing the meta
> document should be fairly efficient.
>
> In the project I'm on (we are using Lucen 3.0.3), we just use
> InderReader.termDocs() to retrieve a document based on a unique ID we
> store in one of the documents fields.
>
> --ewh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

Posted by Earl Hood <ea...@earlhood.com>.

On Mon, Apr 23, 2012 at 10:31 AM, Jong Kim wrote:

> Is there any good way to solve this design problem? Obviously, an
> alternative design would be to split the index into two, and maintain
> static (and large) data in one index and the other dynamic part in the
> other index. However, this approach is not acceptable due to our data
> pattern where the match on the first index yields very large result set,
> and filtering them against the second index is very inefficient due to high
> ratio of disjoint data. In other word, while the alternate approach
> significantly reduces the indexing-time overhead, resulting search is
> unacceptably expensive.

Have you tested to verify it is expensive?  If the meta document is
identified with a unique ID (that can be stored with the main document
so you know which meta document to retrieve), accessing the meta
document should be fairly efficient.

In the project I'm on (we are using Lucen 3.0.3), we just use
InderReader.termDocs() to retrieve a document based on a unique ID we
store in one of the documents fields.

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org