You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2014/12/11 01:56:25 UTC

Has anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

Hi,

  I'm trying to use AutoPhrasingTokenFilterFactory which seems to be a
great solution to our phrase query issues. But doesn't seem to work as
mentioned in the blog :

https://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/

The tokenizer is working as expected during query time, where it's
preserving the phrases as a single token based on the text file. Here's my
field definition :

<fieldType name="text_autophrase" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"
phrases="autophrases.txt" includeTokens="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true" />
<filter class="solr.KStemFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.KStemFilterFactory" />
</analyzer>
</fieldType>

On analyzing, I can see the phrase "seat cushions" (defined in
autophrases.txt) is being indexed as "seat", "seat cushions" and "cushion".

The problem is during the query time. As per the blog, the request handler
needs to use a custom query parser to achieve the result. Here's my entry
in solrconfig.

<requestHandler name="/autophrase" class="solr.SearchHandler">
<lst name="defaults">
<!-- VelocityResponseWriter settings -->
<str name="wt">velocity</str>
<str name="v.template">browse</str>
<str name="v.layout">layout</str>
<str name="title">Solritas</str>

<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
<str name="defType">autophrasingParser</str>
</lst>
</requestHandler>

<queryParser name="autophrasingParser"
class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
<str name="phrases">autophrases.txt</str>
</queryParser>

But if I query "seat cushions"  using this request handler, it's seemed to
be treating the query as two separate terms and returning all results
matching "seat" and "cushion". Not sure what I'm missing here. I'm using
Solr 4.10.

The other question I had is whether
"com.lucidworks.analysis.AutoPhrasingQParserPlugin" supports the edismax
features which is my default parser.

I'll appreciate if anyone provide their feedback.

-Thanks
Shamik