You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marc Weeber <ma...@weeber.net> on 2007/12/19 23:33:37 UTC
Sorting, Caching, and garbage collection

Dear all,

I have been using PyLucene for some time now (really loving it,  
actually), and I have now encountered an intriguing situation. I have  
a large inde file, 17Gb, 50M documents. I want to look for  
cooccurrences for terms in a certain field (a boolean query), and rank  
order the results on another field (date). In some situations this  
works like a charm. Sure, the sort needs lots of memory (around 500M),  
but once that's up and running, sorted queries are really fast. In  
other situations, however, memory use explodes. This occurs for a  
range of java and lucene versions

Let me show the python code (should be readable by all of you Java  
people):

===================
stable memory use code
===================

# initialize things
storeDir = 'LuceneData/PubmedSentenceIndex/'
store = lucene.FSDirectory.getDirectory(storeDir, False)
searcher = lucene.IndexSearcher(store)
sortObject = lucene.Sort('date', False)

# six example query term for cooccurrences
concepts = [
               ["umls/C0086418", "umls/C0003062"],
               ["umls/C0086418", "umls/C0870071"],
               ["umls/C0870071", "umls/C0003062"],
               ["umls/C0870071", "umls/C0449445"],
               ["umls/C0086418", "umls/C0449445"],
               ["umls/C0003062", "umls/C0449445"],
       ]

for cui1, cui2 in concepts:
       q = lucene.BooleanQuery()
       q.add(lucene.TermQuery(lucene.Term('profile', cui1)),  
lucene.BooleanClause.Occur.MUST)
       q.add(lucene.TermQuery(lucene.Term('profile', cui2)),  
lucene.BooleanClause.Occur.MUST)

       hits = searcher.search(q, sortObject)

=======

in the code above, the major objects are created before entering the  
loop. In the loop, the query is generated, and searcher executes the  
query, wit the addition of the sort object. After the first search,  
memory use is about 500M, and remains stable during all other loops

However, if I create the searcher object EACH time in the loop, (for  
instance, just before the actual search is done), each search adds  
300M to the memeory useage of the search. It seems that the garbage  
collection does not really work. I tried to invoke both the python  
(gc.collect()) and the java (lucene.System.gc()) garbage collector,  
but there is no such luck. adding sleep times, for instance, does not  
work either.

Does anyone of you have an inkling what is going on here?

Thanks in advance for any information,

best,

Marc







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org