You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mirko <id...@googlemail.com> on 2013/11/21 11:30:59 UTC

Parse eDisMax queries for keywords

Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query "Footitle season 1" we want to discover the keywords
"season" , get the subsequent number, and boost (or filter for) documents
that match "1" on field name="season".

We have two fields in our schema:

<!-- "titles" contains titles -->
<field name="title" type="text" indexed="true" stored="true"
 multiValued="false"/>

<fieldType name="text" class="solr.TextField" omitNorms="true">
            <analyzer >
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <!-- ... -->
            </analyzer>
</fieldType>

<field name="season" type="season_number" indexed="true" stored="false"
multiValued="false"/>

<!-- "season" contains season numbers -->
<fieldType name="season_number" class="solr.TextField" omitNorms="true" >
<analyzer type="query">
                        <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=".*(?:season)
*0*([0-9]+).*" replacement="$1"/>
                </analyzer>
</fieldType>


Our idea was to use a Keyword tokenizer and a Regex on the "season" field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:

<requestHandler name="/select" class="solr.SearchHandler">
        <lst name="defaults">
            <str name="defType">edismax</str>
            <str name="qf">
            title season
            </str>

        </lst>
</requestHandler>


The problem is that the eDisMax tokenizes the query, so that our field
"season" receives the tokens ["Foo", "season", "1"] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our "season" field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko

Re: Parse eDisMax queries for keywords

Posted by Mirko <id...@googlemail.com>.

Hi Jack,
thanks for your reply. Ok in this case I agree that "enriching" the query
in the application layer is a good idea. We are still a bit puzzled how the
enriched query should look like. I'll post here when we found a solution.
If somebody has suggestions, I'd be happy to hear them.

Mirko


2013/11/21 Jack Krupansky <ja...@basetechnology.com>

> The query parser does its own tokenization and parsing before your
> analyzer tokenizer and filters are called, assuring that only one white
> space-delimited token is analyzed at a time.
>
> You're probably best off having an application layer preprocessor for the
> query that "enriches" the query in the manner that you're describing.
>
> Or, simply settle for a "heuristic" approach that may give you 70% of what
> you want using only existing Solr features on the server side.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mirko
> Sent: Thursday, November 21, 2013 5:30 AM
> To: solr-user@lucene.apache.org
> Subject: Parse eDisMax queries for keywords
>
>
> Hi,
> We would like to implement special handling for queries that contain
> certain keywords. Our particular use case:
>
> In the example query "Footitle season 1" we want to discover the keywords
> "season" , get the subsequent number, and boost (or filter for) documents
> that match "1" on field name="season".
>
> We have two fields in our schema:
>
> <!-- "titles" contains titles -->
> <field name="title" type="text" indexed="true" stored="true"
> multiValued="false"/>
>
> <fieldType name="text" class="solr.TextField" omitNorms="true">
>            <analyzer >
>                <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>                <!-- ... -->
>            </analyzer>
> </fieldType>
>
> <field name="season" type="season_number" indexed="true" stored="false"
> multiValued="false"/>
>
> <!-- "season" contains season numbers -->
> <fieldType name="season_number" class="solr.TextField" omitNorms="true" >
> <analyzer type="query">
>                        <tokenizer class="solr.KeywordTokenizerFactory"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern=".*(?:season)
> *0*([0-9]+).*" replacement="$1"/>
>                </analyzer>
> </fieldType>
>
>
> Our idea was to use a Keyword tokenizer and a Regex on the "season" field
> to extract the season number from the complete query.
>
> However, we use a ExtendedDisMax query parser in our search handler:
>
> <requestHandler name="/select" class="solr.SearchHandler">
>        <lst name="defaults">
>            <str name="defType">edismax</str>
>            <str name="qf">
>            title season
>            </str>
>
>        </lst>
> </requestHandler>
>
>
> The problem is that the eDisMax tokenizes the query, so that our field
> "season" receives the tokens ["Foo", "season", "1"] without any order,
> instead of the complete query.
>
> How can we pass the complete query (untokenized) to the season field? We
> don't understand which tokenizer is used here and why our "season" field
> received tokens instead of the complete query.
>
> Or is there another approach to solve this use case with Solr?
>
> Thanks,
> Mirko
>

Re: Parse eDisMax queries for keywords

Posted by Jack Krupansky <ja...@basetechnology.com>.

The query parser does its own tokenization and parsing before your analyzer 
tokenizer and filters are called, assuring that only one white 
space-delimited token is analyzed at a time.

You're probably best off having an application layer preprocessor for the 
query that "enriches" the query in the manner that you're describing.

Or, simply settle for a "heuristic" approach that may give you 70% of what 
you want using only existing Solr features on the server side.

-- Jack Krupansky

-----Original Message----- 
From: Mirko
Sent: Thursday, November 21, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Parse eDisMax queries for keywords

Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query "Footitle season 1" we want to discover the keywords
"season" , get the subsequent number, and boost (or filter for) documents
that match "1" on field name="season".

We have two fields in our schema:

<!-- "titles" contains titles -->
<field name="title" type="text" indexed="true" stored="true"
multiValued="false"/>

<fieldType name="text" class="solr.TextField" omitNorms="true">
            <analyzer >
                <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <!-- ... -->
            </analyzer>
</fieldType>

<field name="season" type="season_number" indexed="true" stored="false"
multiValued="false"/>

<!-- "season" contains season numbers -->
<fieldType name="season_number" class="solr.TextField" omitNorms="true" >
<analyzer type="query">
                        <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=".*(?:season)
*0*([0-9]+).*" replacement="$1"/>
                </analyzer>
</fieldType>


Our idea was to use a Keyword tokenizer and a Regex on the "season" field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:

<requestHandler name="/select" class="solr.SearchHandler">
        <lst name="defaults">
            <str name="defType">edismax</str>
            <str name="qf">
            title season
            </str>

        </lst>
</requestHandler>


The problem is that the eDisMax tokenizes the query, so that our field
"season" receives the tokens ["Foo", "season", "1"] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our "season" field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko