You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Lu <ch...@gmail.com> on 2007/03/26 17:58:26 UTC

Virtually merge two indexes?

Hi, Gurus,

One thing I want to do is: one index has fields like [primary-key,
not-so-frequently-updated-fields, large-content-fields,...], and
another index has [primary-key, frequently-updated-fields]. The
purpose is to make the indexing process faster by keeping large/stale
fields in one index and small/frequently updated fields in another,
linked via primary-key field.

If I do so, is it possible to keep the index search the same? Parallel
index reader may not cut it because it works only for different
Documents into different indexes. What I want is the same Document
spread on different indexes.

-- 
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Hostetter <ho...@fucit.org>.

: 1. When I use setBoost for document in index C, will that be counted in?

i don't know what "counted in" means .. are you asking how documents
boosts affect ParallelReader? ... because i have no idea.

: 2. Does index A allow any deletion at all? If index A has some
: deletions, I suppose index C should also delete those after
: optimizing? But which deletion takes precedence?

you can do whatever you want in whatever order you want, as long as you
can garuntee the ParallelReader conditions -- if it were me, i would play
it safe and only add/delete from A  just before optimizing and rebuilding
C (during that widow, nothing about your queries will be very safe, so
you'll probably be doing it offline anyway)

: 3. If index A use compound file format, I suppose index C should also
: be the same. When optimizing during creating the compound file, the
: ordering will not be changed?

nothing ever hcanges the relative order of documents, but adding or
optimizing docs can change the specific docIds by collapsing to fill in
the gaps from deletion during a segment merge ... i have no idea about
using Compound files.

: I am also interested to know any real production usage of ParallelReader.

i can't help you there ... my knowledge on this topic is purely
theoretically based on based list discussions.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Lu <ch...@gmail.com>.

Thanks Chris Hostetter!

More questions:

1. When I use setBoost for document in index C, will that be counted in?

2. Does index A allow any deletion at all? If index A has some
deletions, I suppose index C should also delete those after
optimizing? But which deletion takes precedence?

3. If index A use compound file format, I suppose index C should also
be the same. When optimizing during creating the compound file, the
ordering will not be changed?

I am also interested to know any real production usage of ParallelReader.

-- 
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 3/26/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : I think the better question could be, given a large/stale index A, a
> : small/updated index B, and the B does not satisfy the requirement of
> : ParallelReader. How can I create an index C that "add the same
> : documents in the same order of index A"?
>
> 1) optimize A so it has a single segment with no gaps in doc ids.
> 2) iterate over the docs in A, looking at their "unique key" field
>    -- FieldCache should be handy for this.
>    2.1) For each uniqueKey in A pull the corrisponding data out of B and
>         add the doc to C
>
> ...the key here being that B need not be a lucene index, just something
> that provides fast lookup by your unique Key (liek a database perhaps)
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Daniel Noll <da...@nuix.com>.

Chris Hostetter wrote:
> : I think the better question could be, given a large/stale index A, a
> : small/updated index B, and the B does not satisfy the requirement of
> : ParallelReader. How can I create an index C that "add the same
> : documents in the same order of index A"?
> 
> 1) optimize A so it has a single segment with no gaps in doc ids.
> 2) iterate over the docs in A, looking at their "unique key" field
>    -- FieldCache should be handy for this.
>    2.1) For each uniqueKey in A pull the corrisponding data out of B and
>         add the doc to C
> 
> ...the key here being that B need not be a lucene index, just something
> that provides fast lookup by your unique Key (liek a database perhaps)

Now that you mention it...

Couldn't one make an IndexReader subclass which actually constructs fake 
TermDocs and TermPositions from the data in a database?  Then you could 
slot that implementation in anywhere which normally expects an 
IndexReader, and it would work as expected, plus you'd never have to 
worry about document IDs reordering themselves.

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Hostetter <ho...@fucit.org>.

: I think the better question could be, given a large/stale index A, a
: small/updated index B, and the B does not satisfy the requirement of
: ParallelReader. How can I create an index C that "add the same
: documents in the same order of index A"?

1) optimize A so it has a single segment with no gaps in doc ids.
2) iterate over the docs in A, looking at their "unique key" field
   -- FieldCache should be handy for this.
   2.1) For each uniqueKey in A pull the corrisponding data out of B and
        add the doc to C

...the key here being that B need not be a lucene index, just something
that provides fast lookup by your unique Key (liek a database perhaps)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Lu <ch...@gmail.com>.

Hi, Steven,

> Although it's true that you would need to re-index your content for the
> frequently updated fields, you would *not* need to re-index the
> large/stale content index, as long as you keep constant the number of
> documents and the order in which you index them.
>

This seems good but too strict and not practical for me. And I doubt
it's useful for any real practice also...

I think the better question could be, given a large/stale index A, a
small/updated index B, and the B does not satisfy the requirement of
ParallelReader. How can I create an index C that "add the same
documents in the same order of index A"?

-- 
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 3/26/07, Steven Rowe <sa...@syr.edu> wrote:
> Hi Chris,
>
> Chris Lu wrote:
> > Hi, Steven,
> >
> > Thanks for the instant reply! But let's see the warning in the
> > ParallelReader javadoc:
> > "It is up to you to make sure all indexes are created and modified
> > the same way. For example, if you add documents to one index, you need
> > to add the same documents in the same order to the other indexes.
> > Failure to do so will result in undefined behavior."
> >
> > To follow the warning, I need to index the all content again. So
> > basically it defeats my original purpose to keep two indexes: to save
> > the indexing for the large/stale content.
>
> Although it's true that you would need to re-index your content for the
> frequently updated fields, you would *not* need to re-index the
> large/stale content index, as long as you keep constant the number of
> documents and the order in which you index them.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Steven Rowe <sa...@syr.edu>.

Hi Chris,

Chris Lu wrote:
> Hi, Steven,
> 
> Thanks for the instant reply! But let's see the warning in the
> ParallelReader javadoc:
> "It is up to you to make sure all indexes are created and modified
> the same way. For example, if you add documents to one index, you need
> to add the same documents in the same order to the other indexes.
> Failure to do so will result in undefined behavior."
> 
> To follow the warning, I need to index the all content again. So
> basically it defeats my original purpose to keep two indexes: to save
> the indexing for the large/stale content.

Although it's true that you would need to re-index your content for the
frequently updated fields, you would *not* need to re-index the
large/stale content index, as long as you keep constant the number of
documents and the order in which you index them.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Lu <ch...@gmail.com>.

Hi, Steven,

Thanks for the instant reply! But let's see the warning in the
ParallelReader javadoc:
 "It is up to you to make sure all indexes are created and modified
the same way. For example, if you add documents to one index, you need
to add the same documents in the same order to the other indexes.
Failure to do so will result in undefined behavior."

To follow the warning, I need to index the all content again. So
basically it defeats my original purpose to keep two indexes: to save
the indexing for the large/stale content.

-- 
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 3/26/07, Steven Rowe <sa...@syr.edu> wrote:
> I think ParallelReader, first released in Lucene-Java 1.9, should meet
> your needs:
>
> <http://lucene.apache.org/java/docs/api/org/apache/lucene/index/ParallelReader.html>
>
> -----
> An IndexReader which reads multiple, parallel indexes. Each index added
> must have the same number of documents, but typically each contains
> different fields. Each document contains the union of the fields of all
> documents with the same document number. When searching, matches for a
> query term are from the first index added that has the field.
>
> This is useful, e.g., with collections that have large fields which
> change rarely and small fields that change more frequently. The smaller
> fields may be re-indexed in a new index and both indexes may be searched
> together.
>
> Warning: It is up to you to make sure all indexes are created and
> modified the same way. For example, if you add documents to one index,
> you need to add the same documents in the same order to the other
> indexes. Failure to do so will result in undefined behavior.
> -----
>
> Steve
>
> Chris Lu wrote:
> > Hi, Gurus,
> >
> > One thing I want to do is: one index has fields like [primary-key,
> > not-so-frequently-updated-fields, large-content-fields,...], and
> > another index has [primary-key, frequently-updated-fields]. The
> > purpose is to make the indexing process faster by keeping large/stale
> > fields in one index and small/frequently updated fields in another,
> > linked via primary-key field.
> >
> > If I do so, is it possible to keep the index search the same? Parallel
> > index reader may not cut it because it works only for different
> > Documents into different indexes. What I want is the same Document
> > spread on different indexes.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Steven Rowe <sa...@syr.edu>.

I think ParallelReader, first released in Lucene-Java 1.9, should meet
your needs:

<http://lucene.apache.org/java/docs/api/org/apache/lucene/index/ParallelReader.html>

-----
An IndexReader which reads multiple, parallel indexes. Each index added
must have the same number of documents, but typically each contains
different fields. Each document contains the union of the fields of all
documents with the same document number. When searching, matches for a
query term are from the first index added that has the field.

This is useful, e.g., with collections that have large fields which
change rarely and small fields that change more frequently. The smaller
fields may be re-indexed in a new index and both indexes may be searched
together.

Warning: It is up to you to make sure all indexes are created and
modified the same way. For example, if you add documents to one index,
you need to add the same documents in the same order to the other
indexes. Failure to do so will result in undefined behavior.
-----

Steve

Chris Lu wrote:
> Hi, Gurus,
> 
> One thing I want to do is: one index has fields like [primary-key,
> not-so-frequently-updated-fields, large-content-fields,...], and
> another index has [primary-key, frequently-updated-fields]. The
> purpose is to make the indexing process faster by keeping large/stale
> fields in one index and small/frequently updated fields in another,
> linked via primary-key field.
> 
> If I do so, is it possible to keep the index search the same? Parallel
> index reader may not cut it because it works only for different
> Documents into different indexes. What I want is the same Document
> spread on different indexes.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Chris Lu <ch...@gmail.com>.

Thanks!

I need to use fields both indexes to calculate a final ranking.

For example, one index is for the major content, the other index has
the frequently-updated vote/score/popularity information.

Like you said, the ParallelIndex seems too much hassle to maintain. It
could be simple if I store those extra information somewhere else and
use them to calculate the ranking directly.

-- 
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 3/26/07, Xiaocheng Luan <je...@yahoo.com> wrote:
> How the indexes will be searched, do you need to search fields in both indexes? If the ParallelReader is not an attractive solution for you, finding a general solution may be difficult. Would  it be possible to explore solutions that may work for your specific case?
>
> Just a thought.
> Xiaocheng
>
> Chris Lu <ch...@gmail.com> wrote: Hi, Gurus,
>
> One thing I want to do is: one index has fields like [primary-key,
> not-so-frequently-updated-fields, large-content-fields,...], and
> another index has [primary-key, frequently-updated-fields]. The
> purpose is to make the indexing process faster by keeping large/stale
> fields in one index and small/frequently updated fields in another,
> linked via primary-key field.
>
> If I do so, is it possible to keep the index search the same? Parallel
> index reader may not cut it because it works only for different
> Documents into different indexes. What I want is the same Document
> spread on different indexes.
>
> --
> Chris Lu
> -------------------------
> Instant Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------
> TV dinner still cooling?
> Check out "Tonight's Picks" on Yahoo! TV.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Virtually merge two indexes?

Posted by Xiaocheng Luan <je...@yahoo.com>.

How the indexes will be searched, do you need to search fields in both indexes? If the ParallelReader is not an attractive solution for you, finding a general solution may be difficult. Would it be possible to explore solutions that may work for your specific case?

Just a thought.
Xiaocheng

Chris Lu <ch...@gmail.com> wrote: Hi, Gurus,

One thing I want to do is: one index has fields like [primary-key,
not-so-frequently-updated-fields, large-content-fields,...], and
another index has [primary-key, frequently-updated-fields]. The
purpose is to make the indexing process faster by keeping large/stale
fields in one index and small/frequently updated fields in another,
linked via primary-key field.

If I do so, is it possible to keep the index search the same? Parallel
index reader may not cut it because it works only for different
Documents into different indexes. What I want is the same Document
spread on different indexes.

--
Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------
TV dinner still cooling?
Check out "Tonight's Picks" on Yahoo! TV.