You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Charlie Gildawie <Ch...@skyscanner.net> on 2010/11/11 19:29:07 UTC

Memory used by facet queries

Hello All.

My first time post so be kind. Developing a document store with lots and lots of very small documents. (200 million at the moment. Final size will probably be double this at 400 million documents). This is Proof of concept development so we are seeing what a single code can do for us before we consider sharding. We'd rather not shard if we don't have to.

I'm using SOLR 4.0 (for the simple facet pivots and groups which work well).

We're into week 4 of our development and have the production servers etc set up. Everything working very well until we start to test queries with production volumes of data.

I'm running into Java Heap Space exceptions during simple faceting on inverted fields. The fields we are currently faceting on are names - Country / Continent / City names all stored as a Solr.StringField (there are other fields using tokenization to provide initial search but we want to use the simple StringFields to provide faceted navigation). In total we have 10 fields we'd ever want to facet on (8 names fields that are strings and 2 Datepart fields (year and yearMonth) that are also strings)).

This is our first time using SOLR and I didn't realise that we'd need so much heap for facets!

Solr is running in tomcat container and I've currently set tomcat to use a max of

JAVA_OPTS="$JAVA_OPTS -server -Xms512m -Xmx30000m"

I've been reading all I can find online and have seen advice to populate the facets caches first as soon as we've started the solr service. However I'd really like to know if there are ways to reduce the memory footprint. We currently have 32g of physical ram. Adding more ram is an option but I'm being asked the (completely reasonable) question -- "Why do you need so much?"

Please help!

Charlie.


-----Original Message-----
From: Robert Gründler [mailto:robert@dubture.com]
Sent: 11 November 2010 18:14
To: solr-user@lucene.apache.org
Subject: Re: Concatenate multiple tokens into one

I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types).

Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: "EdgeNGram relevancy).


best


-robert




See
On Nov 11, 2010, at 5:27 PM, Nick Martin wrote:

> Hi Robert, All,
>
> I have a similar problem, here is my fieldType,
> http://paste.pocoo.org/show/289910/
> I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
> If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful.
>
> Many thanks
>
> Nick
>
> On 11 Nov 2010, at 00:23, Robert Gründler wrote:
>
>>
>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
>>
>>> Are you sure you really want to throw out stopwords for your use case?  I don't think autocompletion will work how you want if you do.
>>
>> in our case i think it makes sense. the content is targetting the
>> electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion.
>>
>>>
>>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string.
>>
>> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
>>
>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after:
>>
>> public class ConcatFilter extends TokenFilter {
>>
>>      private TokenStream tstream;
>>
>>      protected ConcatFilter(TokenStream input) {
>>              super(input);
>>              this.tstream = input;
>>      }
>>
>>      @Override
>>      public Token next() throws IOException {
>>
>>              Token token = new Token();
>>              StringBuilder builder = new StringBuilder();
>>
>>              TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
>>              TypeAttribute typeAttribute = (TypeAttribute)
>> tstream.getAttribute(TypeAttribute.class);
>>
>>              boolean incremented = false;
>>
>>              while (tstream.incrementToken()) {
>>
>>                      if (typeAttribute.type().equals("word")) {
>>                              builder.append(termAttribute.term());
>>                      }
>>                      incremented = true;
>>              }
>>
>>              token.setTermBuffer(builder.toString());
>>
>>              if (incremented == true)
>>                      return token;
>>
>>              return null;
>>      }
>> }
>>
>> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all.
>>
>>
>> best
>>
>>
>> -robert
>>
>>
>>
>>
>>>
>>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it.
>>>
>>> If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser should work well for this.
>>>
>>> Jonathan
>>>
>>>
>>>
>>> ________________________________________
>>> From: Robert Gründler [robert@dubture.com]
>>> Sent: Wednesday, November 10, 2010 6:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Concatenate multiple tokens into one
>>>
>>> Hi,
>>>
>>> i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes:
>>>
>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create
>>> tokens separated by whitespace --> <filter
>>> class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
>>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw
>>> out stopwords --> <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="([^a-z])" replacement="" replace="all" />  <!-- throw out
>>> all everything except a-z -->
>>>
>>> <!-- actually, here i would like to join multiple tokens together
>>> again, to provide one token for the EdgeNGramFilterFactory -->
>>>
>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" /> <!-- create edgeNGram tokens for autocomplete
>>> matches -->
>>>
>>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results:
>>> Input Query: "George Cloo"
>>> Matches:
>>> - "George Harrison"
>>> - "John Clooridge"
>>> - "George Smith"
>>> -"George Clooney"
>>> - etc
>>>
>>> However, only "George Clooney" should match in the autocompletion use case.
>>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory.
>>> Are there filters which can do such a thing?
>>>
>>> If not, are there examples how to implement a custom TokenFilter?
>>>
>>> thanks!
>>>
>>> -robert
>>>
>>>
>>>
>>>
>>
>



This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Skyscanner.

If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone.

Please contact the sender if you believe you have received this email in error.