You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Simon Stanlake <si...@tradebytes.com> on 2009/04/30 03:55:03 UTC

understanding facets and tokens

Hi,
Trying to debug a faceting performance problem. I've pretty much given up but was hoping someone could shed some light on my problems.

My index has 80 million documents, all of which are small - one 1000 char text field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a brand new server.

I have one field in my schema which represents a city name. It is a non standardized free text field, so you have problems like the following

HOUSTON
HOUSTON TX
HOUSTON, TX
HOUSTON (TX)

I would like to facet on this field and thought I could apply some tokenizers / filters to modify the indexed value to strip out stopwords. To tie it all together I created a filter that would concatenate all of the tokens back into a single token at the end. Here's my field definition from schema.xml

	
	<fieldType name="portCity" class="solr.TextField">
		<analyzer>
			<tokenizer class="solr.StandardTokenizerFactory"/>
			<filter class="solr.StandardFilterFactory"/>
			<!-- stopwords common across all fields -->
			<filter class="solr.StopFilterFactory" words="stopwords.txt" enablePositionIncrements="true"/>
			<!-- stopwords specific to port cities -->
			<filter class="solr.StopFilterFactory" words="portCityStopwords.txt" enablePositionIncrements="true"/>
			<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
			<!-- pull tokens all together again -->
			<filter class="com.tradebytes.solr.ConcatenateFilterFactory"/>
		</analyzer>
	</fieldType>

The analysis seems to be working as I expected and the index contains the values I want. However when I facet on this field the query returns in typically around 30s, versus sub-second when I just use a solr.StrField. I understand from the lists that the method that solr uses to create the facet counts is different depending on whether the field is tokenized vs not tokenized, but I thought I could mitigate that somewhat by making sure that each field only had one token.

Is there anything else I can do here? Can someone shed some light on why a tokenized field takes longer, even if there is only one token per field? I suspect I am going to be stuck with implementing custom field translation before loading but was hoping I could leverage some of the great filters that are built in with solr / lucene. I've played around with fieldcache but so far no luck.

BTW love solr / lucene, great job!

Thanks,
Simon

RE: understanding facets and tokens

Posted by Simon Stanlake <si...@tradebytes.com>.

OK I will try that.

I am pretty sure my concatenate filter is working, I have tested using the AnalysisRequestHandler. I've included the code below.

package com.tradebytes.solr.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class ContcatenateTokenFilter extends TokenFilter {

	protected ContcatenateTokenFilter(TokenStream input) {
		super(input);
	}

	public final Token next(final Token reusableToken)
	throws IOException
	{
		assert reusableToken != null;
		
		Token ret = null;
		Token t = null;
		while ((t = input.next(reusableToken)) != null)
		{
			if (ret != null)
			{
				ret = concatenate(ret,t);
			} else
			{
				ret = t.clone(t.termBuffer(),0,t.termLength(),t.startOffset(),t.endOffset());
			}
		}
		
		return ret;
	}

	private Token concatenate(Token start, Token end)
	{
		char[] buff = new char[start.termLength() + end.termLength() + 1];
		System.arraycopy(start.termBuffer(), 0, buff, 0, start.termLength());
		buff[start.termLength()] = ' ';
		System.arraycopy(end.termBuffer(), 0, buff, start.termLength() + 1, end.termLength());
		return new Token(buff,0,buff.length,0,buff.length);
	}
	
}

As for the JVM heap, the server has 32G of ram so I would still have 8G for the OS. I will try turning it down to 16G and see if that makes a difference, though I suspect that the improvements in the latest solr build will help significantly.

Thanks,
Simon


-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Thursday, April 30, 2009 8:15 PM
To: solr-user@lucene.apache.org
Subject: Re: understanding facets and tokens


Hello Simon,

I'll assume you are using Solr 1.3.  Grab the latest Solr nightly and try with that - your multi-token facets should be faster (are you sure, sure sure you are ending up with a single token).

Also, unrelated to this most probably is the suspiciously large JVM heap.  My guess is it's too large.  Solr will be happier if you leave some RAM to the OS to cache the index itself.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Simon Stanlake <si...@tradebytes.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, April 29, 2009 9:55:03 PM
> Subject: understanding facets and tokens
> 
> Hi,
> Trying to debug a faceting performance problem. I've pretty much given up but 
> was hoping someone could shed some light on my problems.
> 
> My index has 80 million documents, all of which are small - one 1000 char text 
> field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a 
> brand new server.
> 
> I have one field in my schema which represents a city name. It is a non 
> standardized free text field, so you have problems like the following
> 
> HOUSTON
> HOUSTON TX
> HOUSTON, TX
> HOUSTON (TX)
> 
> I would like to facet on this field and thought I could apply some tokenizers / 
> filters to modify the indexed value to strip out stopwords. To tie it all 
> together I created a filter that would concatenate all of the tokens back into a 
> single token at the end. Here's my field definition from schema.xml
> 
>     
>     
>         
>             
>             
>             
>             
> enablePositionIncrements="true"/>
>             
>             
> enablePositionIncrements="true"/>
>             
>             
>             
>         
>     
> 
> The analysis seems to be working as I expected and the index contains the values 
> I want. However when I facet on this field the query returns in typically around 
> 30s, versus sub-second when I just use a solr.StrField. I understand from the 
> lists that the method that solr uses to create the facet counts is different 
> depending on whether the field is tokenized vs not tokenized, but I thought I 
> could mitigate that somewhat by making sure that each field only had one token.
> 
> Is there anything else I can do here? Can someone shed some light on why a 
> tokenized field takes longer, even if there is only one token per field? I 
> suspect I am going to be stuck with implementing custom field translation before 
> loading but was hoping I could leverage some of the great filters that are built 
> in with solr / lucene. I've played around with fieldcache but so far no luck.
> 
> BTW love solr / lucene, great job!
> 
> Thanks,
> Simon

Re: understanding facets and tokens

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hello Simon,

I'll assume you are using Solr 1.3.  Grab the latest Solr nightly and try with that - your multi-token facets should be faster (are you sure, sure sure you are ending up with a single token).

Also, unrelated to this most probably is the suspiciously large JVM heap.  My guess is it's too large.  Solr will be happier if you leave some RAM to the OS to cache the index itself.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Simon Stanlake <si...@tradebytes.com>
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Sent: Wednesday, April 29, 2009 9:55:03 PM
> Subject: understanding facets and tokens
> 
> Hi,
> Trying to debug a faceting performance problem. I've pretty much given up but 
> was hoping someone could shed some light on my problems.
> 
> My index has 80 million documents, all of which are small - one 1000 char text 
> field and a bunch of 30-50 char fields. Got 24G ram allocated to the jvm on a 
> brand new server.
> 
> I have one field in my schema which represents a city name. It is a non 
> standardized free text field, so you have problems like the following
> 
> HOUSTON
> HOUSTON TX
> HOUSTON, TX
> HOUSTON (TX)
> 
> I would like to facet on this field and thought I could apply some tokenizers / 
> filters to modify the indexed value to strip out stopwords. To tie it all 
> together I created a filter that would concatenate all of the tokens back into a 
> single token at the end. Here's my field definition from schema.xml
> 
>     
>     
>         
>             
>             
>             
>             
> enablePositionIncrements="true"/>
>             
>             
> enablePositionIncrements="true"/>
>             
>             
>             
>         
>     
> 
> The analysis seems to be working as I expected and the index contains the values 
> I want. However when I facet on this field the query returns in typically around 
> 30s, versus sub-second when I just use a solr.StrField. I understand from the 
> lists that the method that solr uses to create the facet counts is different 
> depending on whether the field is tokenized vs not tokenized, but I thought I 
> could mitigate that somewhat by making sure that each field only had one token.
> 
> Is there anything else I can do here? Can someone shed some light on why a 
> tokenized field takes longer, even if there is only one token per field? I 
> suspect I am going to be stuck with implementing custom field translation before 
> loading but was hoping I could leverage some of the great filters that are built 
> in with solr / lucene. I've played around with fieldcache but so far no luck.
> 
> BTW love solr / lucene, great job!
> 
> Thanks,
> Simon