You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by jason rutherglen <ja...@yahoo.com> on 2006/08/26 03:02:05 UTC

Possible bug in copyField

When doing a copyField into a text field that is supposed to be stemmed I'm not seeing the stemming occur.  

These are the relevant lines of XML from the schema.xml:

<fieldtype name="text" class="org.apache.solr.schema.TextField">
    <analyzer>
      <tokenizer class="org.apache.solr.analysis.StandardTokenizerFactory"/>
      <filter class="org.apache.solr.analysis.StandardFilterFactory"/>
      <filter class="org.apache.solr.analysis.LowerCaseFilterFactory"/>
      <filter class="org.apache.solr.analysis.StopFilterFactory"/>
    </analyzer>
  </fieldtype>
  <fieldtype name="text_stem" class="org.apache.solr.schema.TextField">
    <analyzer>
      <tokenizer class="org.apache.solr.analysis.StandardTokenizerFactory"/>
      <filter class="org.apache.solr.analysis.StandardFilterFactory"/>
      <filter class="org.apache.solr.analysis.LowerCaseFilterFactory"/>
      <filter class="org.apache.solr.analysis.StopFilterFactory"/>
      <filter class="org.apache.solr.analysis.EnglishPorterFilterFactory"/>
    </analyzer>
  </fieldtype>
</types>

<field name="title_stem"         type="text_stem"       indexed="true"   stored="true"/>

<copyField source="title" dest="title_stem"/>

Re: Possible bug in copyField

Posted by Yonik Seeley <yo...@apache.org>.

On 8/28/06, jason rutherglen <ja...@yahoo.com> wrote:
> Could someone point me to where in the Solr code the Analyzer is applied to a query parser field?

The lucene query parser normally does analysis.  It also does things
like making phrase queries from field that return multiple tokens.

-Yonik

Re: Possible bug in copyField

Posted by Yonik Seeley <yo...@apache.org>.

On 8/28/06, jason rutherglen <ja...@yahoo.com> wrote:
> Could someone point me to where in the Solr code the Analyzer is applied to a query parser field?

The lucene query parser normally does analysis.  It also does things
like making phrase queries from field that return multiple tokens.

-Yonik

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

Yes I am using the contrib/xml-query-parser and added the analyzer code into the TermQuery building sections.  Seems to work, auto create a SpanOrQuery multi-token for Span building.  I assume that works, putting a SpanOrQuery into a SpanNearQuery.  Lucene is an amazing piece of work.  

----- Original Message ----
From: Chris Hostetter <ho...@fucit.org>
To: solr-dev@lucene.apache.org; jason rutherglen <ja...@yahoo.com>
Sent: Monday, August 28, 2006 8:13:11 PM
Subject: Re: Possible bug in copyField

: It's coming from a custom XML query handler which is just a
: serialization of a Query from a client.  The Solr server should do the
: analysis as it has the schema.  This is all to support Span Queries.

Ah .. i see, so you've got a custom request handler getting input which is
an XML representation of a structured query (Booleans and Spans and what
not) and you're converting that into a tree of Query objects in Solr.

Are you using the contrib/xml-query-parser from the Lucene repository?  I
haven't looked at it in much detail since the initial design discussions,
but as i recall (and a quick glace at the test cases seems to confirm
this) when constructing your Parser, you can specify an Analyzer to use
and it hands it down tothe individual QueryBuilders -- so you can use
request.getSchema().getQueryAnalyzer() to get what you need for the parser
(that Analyzer "does the right thing" for all of the fields and dynamic
fields defind in your schema.

if you are using your Own XML syntax/parser you can still use that
analyzer when parsing the input (instead of walking the Query tree when
you are done and re-parsing then)

:
: ----- Original Message ----
: From: Chris Hostetter <ho...@fucit.org>
: To: solr-dev@lucene.apache.org; jason rutherglen <ja...@yahoo.com>
: Sent: Monday, August 28, 2006 6:29:29 PM
: Subject: Re: Possible bug in copyField
:
:
: : Thanks.  Yes I came up with a hacked solution to the problem.  Takes a
: : Query and rewrites the Terms using the Analyzer.  If the Analyzer
:
: typcially the Analisys happens before you construct a Query object --
: where are these Queries comping from that they are already objects but
: haven't been analyzed?
:
: : returns more than one Token then those are ignored.  Made Term.text
: : public.  Good enough for now, can be improved looks like QueryParser
: : already does something like this, however I am confused what the rules
: : are for adding multiple new Terms to a Query.
:
: it depends on wether the tokens occupy the same position.  if multiple
: sequential tokens are returned, it builds a phrase query, if multiple
: parallel tokens are returned it builds BooleanQuery (using SHOULD) ... if
: multiple tokens are reutrned and *some* of the tokens occupy the same
: position, then a MultiPhraseQuery is constructed.
:
:
: :
: : ----- Original Message ----
: : From: Erik Hatcher <er...@ehatchersolutions.com>
: : To: solr-dev@lucene.apache.org
: : Sent: Monday, August 28, 2006 6:02:46 PM
: : Subject: Re: Possible bug in copyField
: :
: :
: : On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:
: :
: : > Could someone point me to where in the Solr code the Analyzer is
: : > applied to a query parser field?
: :
: : IndexSchema.java is where the analyzers are created for indexing and
: : for query parsing.  It's fairly sophisticated in order to take into
: : account all the various field settings from schema.xml.  Hope that
: : helps.
: :
: : Perhaps preaching to the choir... Be aware that changing an analyzer
: : once documents are indexed does not change how they are indexed.
: : They'll need to be re-added to pick up new analysis configuration.
: :
: :     Erik
: :
: :
: :
: :
: :
: :
: :
:
:
:
: -Hoss
:
:
:
:
:
:

-Hoss

Re: Possible bug in copyField

Posted by Chris Hostetter <ho...@fucit.org>.

: It's coming from a custom XML query handler which is just a
: serialization of a Query from a client.  The Solr server should do the
: analysis as it has the schema.  This is all to support Span Queries.

Ah .. i see, so you've got a custom request handler getting input which is
an XML representation of a structured query (Booleans and Spans and what
not) and you're converting that into a tree of Query objects in Solr.

Are you using the contrib/xml-query-parser from the Lucene repository?  I
haven't looked at it in much detail since the initial design discussions,
but as i recall (and a quick glace at the test cases seems to confirm
this) when constructing your Parser, you can specify an Analyzer to use
and it hands it down tothe individual QueryBuilders -- so you can use
request.getSchema().getQueryAnalyzer() to get what you need for the parser
(that Analyzer "does the right thing" for all of the fields and dynamic
fields defind in your schema.

if you are using your Own XML syntax/parser you can still use that
analyzer when parsing the input (instead of walking the Query tree when
you are done and re-parsing then)




:
: ----- Original Message ----
: From: Chris Hostetter <ho...@fucit.org>
: To: solr-dev@lucene.apache.org; jason rutherglen <ja...@yahoo.com>
: Sent: Monday, August 28, 2006 6:29:29 PM
: Subject: Re: Possible bug in copyField
:
:
: : Thanks.  Yes I came up with a hacked solution to the problem.  Takes a
: : Query and rewrites the Terms using the Analyzer.  If the Analyzer
:
: typcially the Analisys happens before you construct a Query object --
: where are these Queries comping from that they are already objects but
: haven't been analyzed?
:
: : returns more than one Token then those are ignored.  Made Term.text
: : public.  Good enough for now, can be improved looks like QueryParser
: : already does something like this, however I am confused what the rules
: : are for adding multiple new Terms to a Query.
:
: it depends on wether the tokens occupy the same position.  if multiple
: sequential tokens are returned, it builds a phrase query, if multiple
: parallel tokens are returned it builds BooleanQuery (using SHOULD) ... if
: multiple tokens are reutrned and *some* of the tokens occupy the same
: position, then a MultiPhraseQuery is constructed.
:
:
: :
: : ----- Original Message ----
: : From: Erik Hatcher <er...@ehatchersolutions.com>
: : To: solr-dev@lucene.apache.org
: : Sent: Monday, August 28, 2006 6:02:46 PM
: : Subject: Re: Possible bug in copyField
: :
: :
: : On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:
: :
: : > Could someone point me to where in the Solr code the Analyzer is
: : > applied to a query parser field?
: :
: : IndexSchema.java is where the analyzers are created for indexing and
: : for query parsing.  It's fairly sophisticated in order to take into
: : account all the various field settings from schema.xml.  Hope that
: : helps.
: :
: : Perhaps preaching to the choir... Be aware that changing an analyzer
: : once documents are indexed does not change how they are indexed.
: : They'll need to be re-added to pick up new analysis configuration.
: :
: :     Erik
: :
: :
: :
: :
: :
: :
: :
:
:
:
: -Hoss
:
:
:
:
:
:



-Hoss

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

It's coming from a custom XML query handler which is just a serialization of a Query from a client.  The Solr server should do the analysis as it has the schema.  This is all to support Span Queries.  

----- Original Message ----
From: Chris Hostetter <ho...@fucit.org>
To: solr-dev@lucene.apache.org; jason rutherglen <ja...@yahoo.com>
Sent: Monday, August 28, 2006 6:29:29 PM
Subject: Re: Possible bug in copyField

: Thanks.  Yes I came up with a hacked solution to the problem.  Takes a
: Query and rewrites the Terms using the Analyzer.  If the Analyzer

typcially the Analisys happens before you construct a Query object --
where are these Queries comping from that they are already objects but
haven't been analyzed?

: returns more than one Token then those are ignored.  Made Term.text
: public.  Good enough for now, can be improved looks like QueryParser
: already does something like this, however I am confused what the rules
: are for adding multiple new Terms to a Query.

it depends on wether the tokens occupy the same position.  if multiple
sequential tokens are returned, it builds a phrase query, if multiple
parallel tokens are returned it builds BooleanQuery (using SHOULD) ... if
multiple tokens are reutrned and *some* of the tokens occupy the same
position, then a MultiPhraseQuery is constructed.

:
: ----- Original Message ----
: From: Erik Hatcher <er...@ehatchersolutions.com>
: To: solr-dev@lucene.apache.org
: Sent: Monday, August 28, 2006 6:02:46 PM
: Subject: Re: Possible bug in copyField
:
:
: On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:
:
: > Could someone point me to where in the Solr code the Analyzer is
: > applied to a query parser field?
:
: IndexSchema.java is where the analyzers are created for indexing and
: for query parsing.  It's fairly sophisticated in order to take into
: account all the various field settings from schema.xml.  Hope that
: helps.
:
: Perhaps preaching to the choir... Be aware that changing an analyzer
: once documents are indexed does not change how they are indexed.
: They'll need to be re-added to pick up new analysis configuration.
:
:     Erik
:
:
:
:
:
:
:

-Hoss

Re: Possible bug in copyField

Posted by Chris Hostetter <ho...@fucit.org>.

: Thanks.  Yes I came up with a hacked solution to the problem.  Takes a
: Query and rewrites the Terms using the Analyzer.  If the Analyzer

typcially the Analisys happens before you construct a Query object --
where are these Queries comping from that they are already objects but
haven't been analyzed?

: returns more than one Token then those are ignored.  Made Term.text
: public.  Good enough for now, can be improved looks like QueryParser
: already does something like this, however I am confused what the rules
: are for adding multiple new Terms to a Query.

it depends on wether the tokens occupy the same position.  if multiple
sequential tokens are returned, it builds a phrase query, if multiple
parallel tokens are returned it builds BooleanQuery (using SHOULD) ... if
multiple tokens are reutrned and *some* of the tokens occupy the same
position, then a MultiPhraseQuery is constructed.


:
: ----- Original Message ----
: From: Erik Hatcher <er...@ehatchersolutions.com>
: To: solr-dev@lucene.apache.org
: Sent: Monday, August 28, 2006 6:02:46 PM
: Subject: Re: Possible bug in copyField
:
:
: On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:
:
: > Could someone point me to where in the Solr code the Analyzer is
: > applied to a query parser field?
:
: IndexSchema.java is where the analyzers are created for indexing and
: for query parsing.  It's fairly sophisticated in order to take into
: account all the various field settings from schema.xml.  Hope that
: helps.
:
: Perhaps preaching to the choir... Be aware that changing an analyzer
: once documents are indexed does not change how they are indexed.
: They'll need to be re-added to pick up new analysis configuration.
:
:     Erik
:
:
:
:
:
:
:



-Hoss

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

Thanks.  Yes I came up with a hacked solution to the problem.  Takes a Query and rewrites the Terms using the Analyzer.  If the Analyzer returns more than one Token then those are ignored.  Made Term.text public.  Good enough for now, can be improved looks like QueryParser already does something like this, however I am confused what the rules are for adding multiple new Terms to a Query.  

----- Original Message ----
From: Erik Hatcher <er...@ehatchersolutions.com>
To: solr-dev@lucene.apache.org
Sent: Monday, August 28, 2006 6:02:46 PM
Subject: Re: Possible bug in copyField

On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:

> Could someone point me to where in the Solr code the Analyzer is  
> applied to a query parser field?

IndexSchema.java is where the analyzers are created for indexing and  
for query parsing.  It's fairly sophisticated in order to take into  
account all the various field settings from schema.xml.  Hope that  
helps.

Perhaps preaching to the choir... Be aware that changing an analyzer  
once documents are indexed does not change how they are indexed.   
They'll need to be re-added to pick up new analysis configuration.

    Erik

Re: Possible bug in copyField

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 28, 2006, at 3:37 PM, jason rutherglen wrote:

> Could someone point me to where in the Solr code the Analyzer is  
> applied to a query parser field?

IndexSchema.java is where the analyzers are created for indexing and  
for query parsing.  It's fairly sophisticated in order to take into  
account all the various field settings from schema.xml.  Hope that  
helps.

Perhaps preaching to the choir... Be aware that changing an analyzer  
once documents are indexed does not change how they are indexed.   
They'll need to be re-added to pick up new analysis configuration.

	Erik

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

Could someone point me to where in the Solr code the Analyzer is applied to a query parser field?  

----- Original Message ----
From: Erik Hatcher <er...@ehatchersolutions.com>
To: solr-user@lucene.apache.org
Sent: Monday, August 28, 2006 11:13:25 AM
Subject: Re: Possible bug in copyField


On Aug 28, 2006, at 1:41 PM, jason rutherglen wrote:
> Ok... Looks like its related to using SpanQueries (I hacked on the  
> XML query code).  I remember a discussion about this issue.  Not  
> something Solr specifically supports so my apologies.  However if  
> anyone knows about this feel free to post something to the Lucene  
> User list.  I will probably manually analyze the terms of the span  
> query and create a stemmed span query.  Is that a good idea?

Well, query terms need to match how they were indexed :)   So it's a  
good idea in that respect.  Stemming is chock full of fun (or  
frustrating) issues like this and I don't have any easy advice, but  
certainly if you're stemming terms during indexing you'll need to  
stem them for queries.   Unless you index the original terms in a  
parallel field or in the same positions as the stemmed ones, where  
you can play with searching with or without stemming on the query side.

    Erik



>
> ----- Original Message ----
> From: Yonik Seeley <yo...@apache.org>
> To: solr-user@lucene.apache.org
> Cc: jason rutherglen <ja...@yahoo.com>
> Sent: Monday, August 28, 2006 7:33:48 AM
> Subject: Re: Possible bug in copyField
>
> On 8/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>>
>> : By looking at what is stored.  Has this worked for others?
>>
>> the "stored" value of a field is allways going to be the pre- 
>> analyzed text
>> -- that's why the stored values in your "text" fields still have  
>> upper
>> case characters and stop words.
>
> And since the stored values will always be the same, it normally
> doesn't make sense to store the targets of copyField if the sources
> are also stored.
>
> Youy can test if stemming was done by searching for a different tense
> of a word in the field.
>
> -Yonik
>
>
>
>

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

Could someone point me to where in the Solr code the Analyzer is applied to a query parser field?  

----- Original Message ----
From: Erik Hatcher <er...@ehatchersolutions.com>
To: solr-user@lucene.apache.org
Sent: Monday, August 28, 2006 11:13:25 AM
Subject: Re: Possible bug in copyField


On Aug 28, 2006, at 1:41 PM, jason rutherglen wrote:
> Ok... Looks like its related to using SpanQueries (I hacked on the  
> XML query code).  I remember a discussion about this issue.  Not  
> something Solr specifically supports so my apologies.  However if  
> anyone knows about this feel free to post something to the Lucene  
> User list.  I will probably manually analyze the terms of the span  
> query and create a stemmed span query.  Is that a good idea?

Well, query terms need to match how they were indexed :)   So it's a  
good idea in that respect.  Stemming is chock full of fun (or  
frustrating) issues like this and I don't have any easy advice, but  
certainly if you're stemming terms during indexing you'll need to  
stem them for queries.   Unless you index the original terms in a  
parallel field or in the same positions as the stemmed ones, where  
you can play with searching with or without stemming on the query side.

    Erik



>
> ----- Original Message ----
> From: Yonik Seeley <yo...@apache.org>
> To: solr-user@lucene.apache.org
> Cc: jason rutherglen <ja...@yahoo.com>
> Sent: Monday, August 28, 2006 7:33:48 AM
> Subject: Re: Possible bug in copyField
>
> On 8/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>>
>> : By looking at what is stored.  Has this worked for others?
>>
>> the "stored" value of a field is allways going to be the pre- 
>> analyzed text
>> -- that's why the stored values in your "text" fields still have  
>> upper
>> case characters and stop words.
>
> And since the stored values will always be the same, it normally
> doesn't make sense to store the targets of copyField if the sources
> are also stored.
>
> Youy can test if stemming was done by searching for a different tense
> of a word in the field.
>
> -Yonik
>
>
>
>

Re: Possible bug in copyField

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Aug 28, 2006, at 1:41 PM, jason rutherglen wrote:
> Ok... Looks like its related to using SpanQueries (I hacked on the  
> XML query code).  I remember a discussion about this issue.  Not  
> something Solr specifically supports so my apologies.  However if  
> anyone knows about this feel free to post something to the Lucene  
> User list.  I will probably manually analyze the terms of the span  
> query and create a stemmed span query.  Is that a good idea?

Well, query terms need to match how they were indexed :)   So it's a  
good idea in that respect.  Stemming is chock full of fun (or  
frustrating) issues like this and I don't have any easy advice, but  
certainly if you're stemming terms during indexing you'll need to  
stem them for queries.   Unless you index the original terms in a  
parallel field or in the same positions as the stemmed ones, where  
you can play with searching with or without stemming on the query side.

	Erik



>
> ----- Original Message ----
> From: Yonik Seeley <yo...@apache.org>
> To: solr-user@lucene.apache.org
> Cc: jason rutherglen <ja...@yahoo.com>
> Sent: Monday, August 28, 2006 7:33:48 AM
> Subject: Re: Possible bug in copyField
>
> On 8/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>>
>> : By looking at what is stored.  Has this worked for others?
>>
>> the "stored" value of a field is allways going to be the pre- 
>> analyzed text
>> -- that's why the stored values in your "text" fields still have  
>> upper
>> case characters and stop words.
>
> And since the stored values will always be the same, it normally
> doesn't make sense to store the targets of copyField if the sources
> are also stored.
>
> Youy can test if stemming was done by searching for a different tense
> of a word in the field.
>
> -Yonik
>
>
>
>

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

Ok... Looks like its related to using SpanQueries (I hacked on the XML query code).  I remember a discussion about this issue.  Not something Solr specifically supports so my apologies.  However if anyone knows about this feel free to post something to the Lucene User list.  I will probably manually analyze the terms of the span query and create a stemmed span query.  Is that a good idea?

----- Original Message ----
From: Yonik Seeley <yo...@apache.org>
To: solr-user@lucene.apache.org
Cc: jason rutherglen <ja...@yahoo.com>
Sent: Monday, August 28, 2006 7:33:48 AM
Subject: Re: Possible bug in copyField

On 8/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> : By looking at what is stored.  Has this worked for others?
>
> the "stored" value of a field is allways going to be the pre-analyzed text
> -- that's why the stored values in your "text" fields still have upper
> case characters and stop words.

And since the stored values will always be the same, it normally
doesn't make sense to store the targets of copyField if the sources
are also stored.

Youy can test if stemming was done by searching for a different tense
of a word in the field.

-Yonik

Re: Possible bug in copyField

Posted by Yonik Seeley <yo...@apache.org>.

On 8/28/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> : By looking at what is stored.  Has this worked for others?
>
> the "stored" value of a field is allways going to be the pre-analyzed text
> -- that's why the stored values in your "text" fields still have upper
> case characters and stop words.

And since the stored values will always be the same, it normally
doesn't make sense to store the targets of copyField if the sources
are also stored.

Youy can test if stemming was done by searching for a different tense
of a word in the field.

-Yonik

Re: Possible bug in copyField

Posted by Chris Hostetter <ho...@fucit.org>.

: By looking at what is stored.  Has this worked for others?

the "stored" value of a field is allways going to be the pre-analyzed text
-- that's why the stored values in your "text" fields still have upper
case characters and stop words.

what matters is whether or not the "indexed" terms of your "text_stem"
fields are really stemmed or not.

I certianly haven't noticed this problem ... using the fields/types you
mentioned before, do you have an example of a doc you've indexed, and
expected to get from a stemmed query that wasn't acctually returned?




-Hoss

Re: Possible bug in copyField

Posted by jason rutherglen <ja...@yahoo.com>.

By looking at what is stored.  Has this worked for others?

----- Original Message ----
From: Yonik Seeley <yo...@apache.org>
To: solr-user@lucene.apache.org; jason rutherglen <ja...@yahoo.com>
Sent: Friday, August 25, 2006 6:35:43 PM
Subject: Re: Possible bug in copyField

On 8/25/06, jason rutherglen <ja...@yahoo.com> wrote:
> When doing a copyField into a text field that is supposed to be stemmed I'm not seeing the stemming occur.

How did you determine that stemming didn't occur?

-Yonik

Re: Possible bug in copyField

Posted by Yonik Seeley <yo...@apache.org>.

On 8/25/06, jason rutherglen <ja...@yahoo.com> wrote:
> When doing a copyField into a text field that is supposed to be stemmed I'm not seeing the stemming occur.

How did you determine that stemming didn't occur?

-Yonik