You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by tierecke <ni...@gmail.com> on 2007/08/06 01:40:37 UTC

docFreq takes long time to execute in a multiple index environment

Hi there,

I have my 25 indexes of 1.8GB each read with MultiReader.
I try to get the document frequency of all the terms in specific documents
and it takes quite a long time - a document with 1000 terms takes around
4:30 minutes to calculate all the document frequencies of its terms - and
there are longer documents than that.

Since I have quite a lot of documents to process (around 12000) - it'll take
forever.
My function of getting the document frequency is listed below (it's for one
single term - but it's called for all the terms in the document term vector.

    public int getdocumentfrequency (String termstr) throws Exception
    {
        Term term=new Term("contents", termstr);
        TermEnum termenum=multireader.terms(term);
        int freq=termenum.docFreq();
        return freq;
    }

Is there a better (i.e. faster) way to get all the document frequencies of a
specific document?

thanks a lot,
Nir.

-- 
View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12009334
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

indexing and searching in the same time

Posted by tierecke <ni...@gmail.com>.

Does Lucene allow searching and indexing simultaneously?

Yes. However, an IndexReader only searches the index as of the "point in
time" that it was opened. Any updates to the index, either added or deleted
documents, will not be visible until the IndexReader is re-opened. So your
application must periodically re-open its IndexReaders to see the latest
updates. The [WWW] IndexReader.isCurrent() method allows you to test whether
any updates have occurred to the index since your IndexReader was opened. 

[from Lucene FAQ]

Still I need to speed this process.


Thanks Daniel, you are completely right.
I changed the code - but it doesn't make it [noticeably faster] - probably
behind the scene it does run on the enum.
I added some kind of hash table that keeps the docfreq already read so if I
meet it again in another document I can retrieve it quickly - is there
another solution? Maybe have a separate Lucene index for this? (In this case
- can I read and write to the same index without closing it and reopening
it? I want to read from it and if I don't find the docfreq there, calculate
it and put it in the index).

10x Nir.


Daniel Naber-10 wrote:
> 
> On Monday 06 August 2007 01:40, tierecke wrote:
> 
>>         Term term=new Term("contents", termstr);
>>         TermEnum termenum=multireader.terms(term);
>>         int freq=termenum.docFreq();
> 
> IndexReader has a docFreq() method, no need to get a Term enumeration.
> 
> regards
>  Daniel
> 
> 

-- 
View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12015687
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: You are right but it doesn't make it faster.

Posted by testn <te...@doramail.com>.

Does it mean you already reuse IndexReader without reopening it? If you
haven't done so, please try it out. docFreq() should be really quick.


Thanks Daniel, you are completely right.
I changed the code - but it doesn't make it [noticeably faster] - probably
behind the scene it does run on the enum.
I added some kind of hash table that keeps the docfreq already read so if I
meet it again in another document I can retrieve it quickly - is there
another solution? Maybe have a separate Lucene index for this? (In this case
- can I read and write to the same index without closing it and reopening
it? I want to read from it and if I don't find the docfreq there, calculate
it and put it in the index).

10x Nir.


Daniel Naber-10 wrote:
> 
> On Monday 06 August 2007 01:40, tierecke wrote:
> 
>>         Term term=new Term("contents", termstr);
>>         TermEnum termenum=multireader.terms(term);
>>         int freq=termenum.docFreq();
> 
> IndexReader has a docFreq() method, no need to get a Term enumeration.
> 
> regards
>  Daniel
> 
> 

-- 
View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12026814
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: You are right but it doesn't make it faster.

Posted by Paul Elschot <pa...@xs4all.nl>.

Nir,

You can speed this up (maybe a lot) by moving the disk head(s)
as little as possible.

Have a look at the file formats of Lucene to get the idea.

In your outer loop iterate over the readers of the multireader.
For each reader iterate over the terms in sorted order.
And don't access the index in any other way while doing this,
that is, do no query searches and no updates.

A bit of bookkeeping per term it will make it straightforward
to compute the total document frequencies.

Regards,
Paul Elschot



On Monday 06 August 2007 13:12, tierecke wrote:
> 
> Thanks Daniel, you are completely right.
> I changed the code - but it doesn't make it [noticeably faster] - probably
> behind the scene it does run on the enum.
> I added some kind of hash table that keeps the docfreq already read so if I
> meet it again in another document I can retrieve it quickly - is there
> another solution? Maybe have a separate Lucene index for this? (In this case
> - can I read and write to the same index without closing it and reopening
> it? I want to read from it and if I don't find the docfreq there, calculate
> it and put it in the index).
> 
> 10x Nir.
> 
> 
> On Monday 06 August 2007 01:40, tierecke wrote:
> 
> >         Term term=new Term("contents", termstr);
> >         TermEnum termenum=multireader.terms(term);
> >         int freq=termenum.docFreq();
> 
> IndexReader has a docFreq() method, no need to get a Term enumeration.
> 
> regards
>  Daniel
> 
> -- 
> View this message in context: 
http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12014472
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

You are right but it doesn't make it faster.

Posted by tierecke <ni...@gmail.com>.

Thanks Daniel, you are completely right.
I changed the code - but it doesn't make it [noticeably faster] - probably
behind the scene it does run on the enum.
I added some kind of hash table that keeps the docfreq already read so if I
meet it again in another document I can retrieve it quickly - is there
another solution? Maybe have a separate Lucene index for this? (In this case
- can I read and write to the same index without closing it and reopening
it? I want to read from it and if I don't find the docfreq there, calculate
it and put it in the index).

10x Nir.


On Monday 06 August 2007 01:40, tierecke wrote:

>         Term term=new Term("contents", termstr);
>         TermEnum termenum=multireader.terms(term);
>         int freq=termenum.docFreq();

IndexReader has a docFreq() method, no need to get a Term enumeration.

regards
 Daniel

-- 
View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12014472
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: docFreq takes long time to execute in a multiple index environment

Posted by Daniel Naber <lu...@danielnaber.de>.

On Monday 06 August 2007 01:40, tierecke wrote:

>         Term term=new Term("contents", termstr);
>         TermEnum termenum=multireader.terms(term);
>         int freq=termenum.docFreq();

IndexReader has a docFreq() method, no need to get a Term enumeration.

regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org