You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Chalupa <ac...@hotmail.com> on 2010/08/12 20:53:52 UTC
Field getting tokenized prior to charFilter on select query
I'm attempting to make use of PatternReplaceCharFilterFactory, but am running into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27). It seems that on a real query the charFilter isn't executed prior to the tokenizer.
I modified the example configuration included in the distribution with the following fieldType in schema.xml and mapped a new field to it.
<!-- Field defintion for name text field -->
<fieldtype name="nameText" class="solr.TextField">
<analyzer>
<!-- Replace (char & char) or (char and char) with (char&char) -->
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(.*?)(\b(\w) (&|and) (\w))(.*?)" replacement="$1$3&$5$6"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
</analyzer>
</fieldtype>
<field name="name" type="nameText" indexed="true" stored="true" required="false" omitNorms="true" />
I vaildated that the regex works properly outside of Solr using just Java. The regex attempts to normalize single word characters around an '&' into something consistent for searching. For example, it will turn "A & B Company" into "A&B Company". The user can then search on "A&B", "A and B", or "A & B" and the proper result will be located.
However, when I import a document with "A & B Company" I can't ever locate it with "A & B" query. It can be located with "A&B" query. When I run analysis.jsp it works properly and it will match using any of the combinations.
So from this I concluded that it was being indexed properly, but for some reason the query wasn't applying the regex properly. I hooked up a debugger and could see a difference in how the analyzer was applying the charFilter and how the query was applying the charFilter. When the analyzer invoked PatternReplaceCharFilterFactory.create(CharStream) the entire field was provided in a single call. When the query invoked PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 seperate tokens (A, &, B). Because of this the regex won't ever locate the full string in the field.
I'm using the following encoded URL to perform the query.
This works
http://localhost:8983/solr/select?q=name:%28a%26b%29
But this doesn't
http://localhost:8983/solr/select?q=name:%28a+%26+b%29
Why is the query parser tokenizing the name field prior to the charFilter getting a chance to perform processing?
Re: Field getting tokenized prior to charFilter on select query
Posted by Chris Hostetter <ho...@fucit.org>.
You are seeing the effects of the default QueryParser.
whitespace (like '+','-','"','*', etc...) is a "special character" to the
Lucene QueryParser. Un-Escaped/Quoted qhitespace tells the query parser
to construct a BooleanQuery containing multiple clauses -- each clause is
analyzed seperately.
To have the entire input passed to the Analyzer as a single string, you
would either quote it, or use a differnet QParser such as the "field"
QParser...
http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers
http://lucene.apache.org/solr/api/org/apache/solr/search/FieldQParserPlugin.html
: entire field was provided in a single call. When the query invoked
: PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times
: with 3 seperate tokens (A, &, B). Because of this the regex won't ever
: locate the full string in the field.
-Hoss
--
http://lucenerevolution.org/ ... October 7-8, Boston
http://bit.ly/stump-hoss ... Stump The Chump!