You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mike Cawson <mi...@yahoo.co.uk> on 2010/12/16 02:49:34 UTC

Where does Lucene recognise it has encountered a new term for the first time?

I’m using Lucene to index database records and text documents.

I want to provide efficient fuzzy queries over the data so I’m using a secondary 
Lucene index for all of the distinct terms encountered in the primary index.

Each ‘document’ in the secondary index is a term from the primary index with 
fields for its q-grams, phonetic key(s) and synonyms.

It’s easy to populate the secondary index after indexing all of the records and 
text documents using an IndexReader. However, to keep the secondary index up to 
date I need to recognise when new terms are encountered for the first time, but 
even looking deep into Lucene code and stepping through the indexing process 
hasn’t revealed where this occurs – I presume because it doesn’t happen in a 
single place but rather once in the in-memory term cache, once when the cache is 
flushed into a segment, and again when segments are optimised.

Is this correct? Can anyone suggest how to maintain a secondary index of terms? 
Perhaps only when the main index is optimised?

Thanks, Mike


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Where does Lucene recognise it has encountered a new term for the first time?

Posted by Li Li <fa...@gmail.com>.
  I don't understand your problem well. but needing know when a new
term occur is a hard problem because when new document is added, it
will be added to a new segment. I think you can only do this in the
last merge in optimization stage. You can  read the codes in
SegmentMerger.mergeTermInfos() . It merges all the terms of the merged
segments. because terms are order by fieldName then term, it can use
very small memory to merge terms.
    Or if you need knowing the new terms in current segment when
building index, FreqProxTermsWriterPerField.newTerm will be called if
the term occured for the first time.

2010/12/16 Mike Cawson <mi...@yahoo.co.uk>:
> I’m using Lucene to index database records and text documents.
>
> I want to provide efficient fuzzy queries over the data so I’m using a secondary
> Lucene index for all of the distinct terms encountered in the primary index.
>
> Each ‘document’ in the secondary index is a term from the primary index with
> fields for its q-grams, phonetic key(s) and synonyms.
>
> It’s easy to populate the secondary index after indexing all of the records and
> text documents using an IndexReader. However, to keep the secondary index up to
> date I need to recognise when new terms are encountered for the first time, but
> even looking deep into Lucene code and stepping through the indexing process
> hasn’t revealed where this occurs – I presume because it doesn’t happen in a
> single place but rather once in the in-memory term cache, once when the cache is
> flushed into a segment, and again when segments are optimised.
>
> Is this correct? Can anyone suggest how to maintain a secondary index of terms?
> Perhaps only when the main index is optimised?
>
> Thanks, Mike
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org