You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mdz-munich <se...@bsb-muenchen.de> on 2011/04/26 15:28:45 UTC

TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. 
No we try to obtain the first 400 most common words for "CommonGramsFilter"
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks & best regards,

Sebastian 

--
View this message in context: http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Posted by mdz-munich <se...@bsb-muenchen.de>.
Thanks for your suggestion. It seems to be the use of shards and
TermsComponent together. Now we simple requesting shard-by-shard without
"shard" and "shard.qt" params and merge the results via XSLT.

Sebastian 



 



--
View this message in context: http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2866499.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Don't know your use case, but if you just want a list of the 400 most common words you can use the lucene contrib. HighFreqTerms.java with the - t flag.  You have to point it at your lucene index.  You also probably don't want Solr to be running and want to give the JVM running HighFreqTerms a lot of memory.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log

Tom
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: mdz-munich [mailto:sebastian.lutze@bsb-muenchen.de] 
Sent: Tuesday, April 26, 2011 9:29 AM
To: solr-user@lucene.apache.org
Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards á 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. 
No we try to obtain the first 400 most common words for "CommonGramsFilter"
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks & best regards,

Sebastian 

--
View this message in context: http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.