You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Chalupa <ac...@hotmail.com> on 2010/08/12 20:53:52 UTC

Field getting tokenized prior to charFilter on select query

I'm attempting to make use of PatternReplaceCharFilterFactory, but am running into issues on both 1.4.1 ( I ported it) and on nightly (4.0-2010-07-27).  It seems that on a real query the charFilter isn't executed prior to the tokenizer. 

I modified the example configuration included in the distribution with the following fieldType in schema.xml and mapped a new field to it. 
    <!-- Field defintion for name text field -->
    <fieldtype name="nameText" class="solr.TextField">
      <analyzer>
        <!-- Replace (char & char) or (char and char) with (char&char) -->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
            pattern="(.*?)(\b(\w) (&amp;|and) (\w))(.*?)" replacement="$1$3&amp;$5$6"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
      </analyzer>
    </fieldtype>    
    
    <field name="name" type="nameText" indexed="true" stored="true" required="false" omitNorms="true" />
    
I vaildated that the regex works properly outside of Solr using just Java.  The regex attempts to normalize single word characters around an '&' into something consistent for searching.  For example, it will turn "A & B Company" into "A&B Company".  The user can then search on "A&B", "A and B", or "A & B" and the proper result will be located.

However, when I import a document with "A & B Company" I can't ever locate it with "A & B" query.  It can be located with "A&B" query.  When I run analysis.jsp it works properly and it will match using any of the combinations.

So from this I concluded that it was being indexed properly, but for some reason the query wasn't applying the regex properly.  I hooked up a debugger and could see a difference in how the analyzer was applying the charFilter and how the query was applying the charFilter.  When the analyzer invoked PatternReplaceCharFilterFactory.create(CharStream) the entire field was provided in a single call.  When the query invoked PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times with 3 seperate tokens (A, &, B).  Because of this the regex won't ever locate the full string in the field.

I'm using the following encoded URL to perform the query.  
This works
http://localhost:8983/solr/select?q=name:%28a%26b%29

But this doesn't
http://localhost:8983/solr/select?q=name:%28a+%26+b%29

Why is the query parser tokenizing the name field prior to the charFilter getting a chance to perform processing? 		 	   		  

Re: Field getting tokenized prior to charFilter on select query

Posted by Chris Hostetter <ho...@fucit.org>.
You are seeing the effects of the default QueryParser.

whitespace (like '+','-','"','*', etc...) is a "special character" to the 
Lucene QueryParser.  Un-Escaped/Quoted qhitespace tells the query parser 
to construct a BooleanQuery containing multiple clauses -- each clause is 
analyzed seperately.

To have the entire input passed to the Analyzer as a single string, you 
would either quote it, or use a differnet QParser such as the "field" 
QParser...

http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers
http://lucene.apache.org/solr/api/org/apache/solr/search/FieldQParserPlugin.html

: entire field was provided in a single call.  When the query invoked 
: PatternReplaceCharFilterFactory.create(CharStream) it invoked it 3 times 
: with 3 seperate tokens (A, &, B).  Because of this the regex won't ever 
: locate the full string in the field.

-Hoss

--
http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss      ...  Stump The Chump!