You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ganesh <em...@yahoo.co.in> on 2009/06/30 09:37:06 UTC

Term Frequency vector consumes memory

At the end of the day, I used to build the stats of top indexed terms. I enabled term frequency for the single field. It is working fine. I could able to get the top terms and its frequencies. It consumes huge amount of RAM. My index size is 5 GB and has 8 million records. If i didn't enable term vector then i could do index up to 17 GB with 40 million records.   

When IndexReader/ Searcher is opened, whether it will load all term vector frequncies? 

Consider i have enabled this option and indexed say 5GB, Now i don't want the Reader / Searcher to load term vector. I want to switch off this feature? Is that possible without re-indexing?

Regards
Ganesh
Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Term Frequency vector consumes memory

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 1, 2009, at 1:39 AM, Ganesh wrote:

> Thanks for your reply.
>
> My requirement is to fetch the list of top frequency terms indexed  
> in a day. I used the logic said in the article (refer below link)
> http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index
>
> I enabled term vector for a field. Indexed the content and i am able  
> to retrieve the list of top indexed term in a day / date range.
>
> When IndexReader/ Searcher is opened, whether it will load all term  
> vector frequncies?

No, it won't. Term Vecs are stored on disk much like the stored fields.

>
> Consider i have enabled this option and indexed say 5GB, Now i  
> don't  want the Reader / Searcher to load term vector. I want to  
> switch off
> this feature? Is that possible without re-indexing?

I suppose.  Although the approach you are using seems to rely on a  
custom Collector, which means you need to not use that one.

Storing Term Vecs will indeed make your index much bigger, but it  
shouldn't effect memory much, unless you are caching, which probably  
isn't a bad idea anyway.



>
> Regards
> Ganesh
>
> ----- Original Message -----
> From: "Grant Ingersoll" <gs...@apache.org>
> To: <ja...@lucene.apache.org>
> Sent: Tuesday, June 30, 2009 9:48 PM
> Subject: Re: Term Frequency vector consumes memory
>
>
>> In Lucene, a Term Vector is a specific thing that is stored on disk
>> when creating a Document and Field.  It is optional and off by
>> default.  It is separate from being able to get the term frequencies
>> for all the docs in a specific field.  The former is decided at
>> indexing time and there is no way to remove it w/o reindexing.
>> Furthermore, it is not loaded into memory by the IndexReader.  Term
>> Frequencies are accessed via the TermDocs.
>>
>> Can you clarify a bit more what you are looking to do?  Perhaps some
>> sample code will help demonstrate what you'd like to turn off, as I  
>> am
>> not clear on your question.
>>
>> Cheers,
>> Grant
>>
>> On Jun 30, 2009, at 3:37 AM, Ganesh wrote:
>>
>>> At the end of the day, I used to build the stats of top indexed
>>> terms. I enabled term frequency for the single field. It is working
>>> fine. I could able to get the top terms and its frequencies. It
>>> consumes huge amount of RAM. My index size is 5 GB and has 8 million
>>> records. If i didn't enable term vector then i could do index up to
>>> 17 GB with 40 million records.
>>>
>>> When IndexReader/ Searcher is opened, whether it will load all term
>>> vector frequncies?
>>>
>>> Consider i have enabled this option and indexed say 5GB, Now i don't
>>> want the Reader / Searcher to load term vector. I want to switch off
>>> this feature? Is that possible without re-indexing?
>>>
>>> Regards
>>> Ganesh
>>> Send instant messages to your online friends http://in.messenger.yahoo.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> Send instant messages to your online friends http://in.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Term Frequency vector consumes memory

Posted by Ganesh <em...@yahoo.co.in>.
Thanks for your reply.

My requirement is to fetch the list of top frequency terms indexed in a day. I used the logic said in the article (refer below link)
http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index

I enabled term vector for a field. Indexed the content and i am able to retrieve the list of top indexed term in a day / date range.

When IndexReader/ Searcher is opened, whether it will load all term vector frequncies? 

Consider i have enabled this option and indexed say 5GB, Now i don't  want the Reader / Searcher to load term vector. I want to switch off  
 this feature? Is that possible without re-indexing?

Regards
Ganesh

----- Original Message ----- 
From: "Grant Ingersoll" <gs...@apache.org>
To: <ja...@lucene.apache.org>
Sent: Tuesday, June 30, 2009 9:48 PM
Subject: Re: Term Frequency vector consumes memory


> In Lucene, a Term Vector is a specific thing that is stored on disk  
> when creating a Document and Field.  It is optional and off by  
> default.  It is separate from being able to get the term frequencies  
> for all the docs in a specific field.  The former is decided at  
> indexing time and there is no way to remove it w/o reindexing.   
> Furthermore, it is not loaded into memory by the IndexReader.  Term  
> Frequencies are accessed via the TermDocs.
> 
> Can you clarify a bit more what you are looking to do?  Perhaps some  
> sample code will help demonstrate what you'd like to turn off, as I am  
> not clear on your question.
> 
> Cheers,
> Grant
> 
> On Jun 30, 2009, at 3:37 AM, Ganesh wrote:
> 
>> At the end of the day, I used to build the stats of top indexed  
>> terms. I enabled term frequency for the single field. It is working  
>> fine. I could able to get the top terms and its frequencies. It  
>> consumes huge amount of RAM. My index size is 5 GB and has 8 million  
>> records. If i didn't enable term vector then i could do index up to  
>> 17 GB with 40 million records.
>>
>> When IndexReader/ Searcher is opened, whether it will load all term  
>> vector frequncies?
>>
>> Consider i have enabled this option and indexed say 5GB, Now i don't  
>> want the Reader / Searcher to load term vector. I want to switch off  
>> this feature? Is that possible without re-indexing?
>>
>> Regards
>> Ganesh
>> Send instant messages to your online friends http://in.messenger.yahoo.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Term Frequency vector consumes memory

Posted by Grant Ingersoll <gs...@apache.org>.
In Lucene, a Term Vector is a specific thing that is stored on disk  
when creating a Document and Field.  It is optional and off by  
default.  It is separate from being able to get the term frequencies  
for all the docs in a specific field.  The former is decided at  
indexing time and there is no way to remove it w/o reindexing.   
Furthermore, it is not loaded into memory by the IndexReader.  Term  
Frequencies are accessed via the TermDocs.

Can you clarify a bit more what you are looking to do?  Perhaps some  
sample code will help demonstrate what you'd like to turn off, as I am  
not clear on your question.

Cheers,
Grant

On Jun 30, 2009, at 3:37 AM, Ganesh wrote:

> At the end of the day, I used to build the stats of top indexed  
> terms. I enabled term frequency for the single field. It is working  
> fine. I could able to get the top terms and its frequencies. It  
> consumes huge amount of RAM. My index size is 5 GB and has 8 million  
> records. If i didn't enable term vector then i could do index up to  
> 17 GB with 40 million records.
>
> When IndexReader/ Searcher is opened, whether it will load all term  
> vector frequncies?
>
> Consider i have enabled this option and indexed say 5GB, Now i don't  
> want the Reader / Searcher to load term vector. I want to switch off  
> this feature? Is that possible without re-indexing?
>
> Regards
> Ganesh
> Send instant messages to your online friends http://in.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org