You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Amit Kumar <am...@uiuc.edu> on 2006/08/09 21:13:21 UTC

IndexReader.getTermFreqVector penality

Hi Lucene Users,

I am using the lucene indices to get term frequencies. I just wanted  
to check with you about the
time it is taking to retrieve these term freq. Please suggest if I  
can improve the code/index or if
this is expected. It takes 8 to 9 seconds to retrieve the term freq  
values of all 1030 documents,
with an index size of ~530MB.

Another question I have is Do I need to have Field.Store.Yes to get  
the term freq vector?

Index Details:
-------------------
Size: 532 MB,
1032 Documents with varying number of terms from 600 to 100,000
The field is indexed as Field.Store.YES,  
Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS


Term Freq Retrieval Time Values:
-------------------------------------

The time ranges in 8 to 9 seconds

  long s = System.currentTimeMillis();
  TermFreqVector termFreqVector;
     for (int i = 0; i < 1030; i++) {
       if (!reader.isDeleted(i)) {
        termFreqVector   = reader.getTermFreqVector(i, field);
        }
     }
     long l = System.currentTimeMillis();


Hardware and Memory Settings:
-------------------------------------------
-Xmx 2048m -XX:PermSize=16m -XX:MaxPermSize=128m

Dual 1800 MHz Optron on 32 bit Linux 2.6.15.2; Lucene 2.0.0.




How can I get better results? Can I?



Many thanks for your help.
-Amit





---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------

Re: IndexReader.getTermFreqVector penality

Posted by Amit Kumar <am...@uiuc.edu>.

Yes thanks Grant I realize that if I need the term freq in all the  
documents I could use TermEnum, but I have a use case where
I may need term frequencies of only selected documents, and the worst  
case scenario might be  term freq for n-1
documents, where n is the total number of documents in the index.

-Amit


On Aug 9, 2006, at 2:34 PM, Grant Ingersoll wrote:

> Hi Amit,
>
> If you want all the freqs of all the terms (or even just some of  
> the terms) in all documents, you don't need to use Term Vectors,  
> take a look at TermEnum and TermDocs.
>
> If you want for specific documents, then you do need Term Vectors.   
> You may get some CPU ticks by only keeping Positions (and not  
> offsets, assuming you don't need them), but I haven't confirmed  
> this.  I just figure it is slightly less data to read and write so  
> it should be faster.
>
> No, you do not need to Store your docs to have a term vector.  They  
> are written to different parts of the Index.
>
> -Grant
>
> On Aug 9, 2006, at 3:13 PM, Amit Kumar wrote:
>
>> Hi Lucene Users,
>>
>> I am using the lucene indices to get term frequencies. I just  
>> wanted to check with you about the
>> time it is taking to retrieve these term freq. Please suggest if I  
>> can improve the code/index or if
>> this is expected. It takes 8 to 9 seconds to retrieve the term  
>> freq values of all 1030 documents,
>> with an index size of ~530MB.
>>
>> Another question I have is Do I need to have Field.Store.Yes to  
>> get the term freq vector?
>>
>> Index Details:
>> -------------------
>> Size: 532 MB,
>> 1032 Documents with varying number of terms from 600 to 100,000
>> The field is indexed as Field.Store.YES,  
>> Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS
>>
>>
>> Term Freq Retrieval Time Values:
>> -------------------------------------
>>
>> The time ranges in 8 to 9 seconds
>>
>>  long s = System.currentTimeMillis();
>>  TermFreqVector termFreqVector;
>>     for (int i = 0; i < 1030; i++) {
>>       if (!reader.isDeleted(i)) {
>>        termFreqVector   = reader.getTermFreqVector(i, field);
>>        }
>>     }
>>     long l = System.currentTimeMillis();
>>
>>
>> Hardware and Memory Settings:
>> -------------------------------------------
>> -Xmx 2048m -XX:PermSize=16m -XX:MaxPermSize=128m
>>
>> Dual 1800 MHz Optron on 32 bit Linux 2.6.15.2; Lucene 2.0.0.
>>
>>
>>
>>
>> How can I get better results? Can I?
>>
>>
>>
>> Many thanks for your help.
>> -Amit
>>
>>
>>
>>
>>
>> ---------------------------------------------------------
>> Amit Kumar
>> Research Programmer
>> The Graduate School of Library and Information Science
>> University of Illinois, Urbana Champaign IL, 61820
>> phone: 217-333-4118 fax: 217-244-3302
>> ---------------------------------------------------------
>>
>>
>>
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------

Re: IndexReader.getTermFreqVector penality

Posted by Grant Ingersoll <gr...@gmail.com>.

Hi Amit,

If you want all the freqs of all the terms (or even just some of the  
terms) in all documents, you don't need to use Term Vectors, take a  
look at TermEnum and TermDocs.

If you want for specific documents, then you do need Term Vectors.   
You may get some CPU ticks by only keeping Positions (and not  
offsets, assuming you don't need them), but I haven't confirmed  
this.  I just figure it is slightly less data to read and write so it  
should be faster.

No, you do not need to Store your docs to have a term vector.  They  
are written to different parts of the Index.

-Grant

On Aug 9, 2006, at 3:13 PM, Amit Kumar wrote:

> Hi Lucene Users,
>
> I am using the lucene indices to get term frequencies. I just  
> wanted to check with you about the
> time it is taking to retrieve these term freq. Please suggest if I  
> can improve the code/index or if
> this is expected. It takes 8 to 9 seconds to retrieve the term freq  
> values of all 1030 documents,
> with an index size of ~530MB.
>
> Another question I have is Do I need to have Field.Store.Yes to get  
> the term freq vector?
>
> Index Details:
> -------------------
> Size: 532 MB,
> 1032 Documents with varying number of terms from 600 to 100,000
> The field is indexed as Field.Store.YES,  
> Field.Index.TOKENIZED,Field.TermVector.WITH_POSITIONS_OFFSETS
>
>
> Term Freq Retrieval Time Values:
> -------------------------------------
>
> The time ranges in 8 to 9 seconds
>
>  long s = System.currentTimeMillis();
>  TermFreqVector termFreqVector;
>     for (int i = 0; i < 1030; i++) {
>       if (!reader.isDeleted(i)) {
>        termFreqVector   = reader.getTermFreqVector(i, field);
>        }
>     }
>     long l = System.currentTimeMillis();
>
>
> Hardware and Memory Settings:
> -------------------------------------------
> -Xmx 2048m -XX:PermSize=16m -XX:MaxPermSize=128m
>
> Dual 1800 MHz Optron on 32 bit Linux 2.6.15.2; Lucene 2.0.0.
>
>
>
>
> How can I get better results? Can I?
>
>
>
> Many thanks for your help.
> -Amit
>
>
>
>
>
> ---------------------------------------------------------
> Amit Kumar
> Research Programmer
> The Graduate School of Library and Information Science
> University of Illinois, Urbana Champaign IL, 61820
> phone: 217-333-4118 fax: 217-244-3302
> ---------------------------------------------------------
>
>
>
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org