You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Martin Wunderlich <ma...@gmx.net> on 2015/03/25 20:22:37 UTC

Applying Tokenizers and Filters to CopyFields

Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. 

To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): 

"Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as follows: 

<field name="original" type="text_original" indexed="true" stored="true" required="true“/>

    <fieldType name="text_windex_original" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>


Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: 

- one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: 

<field name="stopwords_removed" type="text_stopwords_removed" indexed="true" stored="true" required="true“/>

    <fieldType name="text_stopwords_removed" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words=„stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


- a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is defined as follows: 

<field name="expanded" type="text_multiplied" indexed="true" stored="true" required="true“/>expanded

    <fieldType name="text_expanded" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: 

  <copyField source="original" dest="stopwords_removed"/>
  <copyField source="original" dest="expanded“/>

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and „sprache“. 

The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true <http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>

will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he would like to carry out: A standard search (field original) or an expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading and googling). It is probably something simple that I missing. 
Thanks a lot in advance for any help. 

Cheers, 

Martin
 

ST
Was
zum
Wesen
der
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
SF
Was
zum
Wesen
 
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
LCF
was
zum
wesen
 
welt
gehört
kann
die
sprache
nicht
ausdrücken

Re: Applying Tokenizers and Filters to CopyFields

Posted by Martin Wunderlich <ma...@gmx.net>.

Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it sounds good. Since the field contents are currently all identical, I can’t really test it, yet. 

Cheers, 

Martin
 



> Am 25.03.2015 um 21:27 schrieb Ahmet Arslan <io...@yahoo.com.INVALID>:
> 
> Hi Martin,
> 
> fq means filter query. May be you want to use qf (query fields) parameter of edismax?
> 
> 
> 
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <ma...@gmx.net> wrote:
> Hi all, 
> 
> I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. 
> 
> To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): 
> 
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
> 
> 
> This sentence will be indexed in a field called „original“ that is defined as follows: 
> 
> <field name="original" type="text_original" indexed="true" stored="true" required="true“/>
> 
>    <fieldType name="text_windex_original" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: 
> 
> - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: 
> 
> <field name="stopwords_removed" type="text_stopwords_removed" indexed="true" stored="true" required="true“/>
> 
>    <fieldType name="text_stopwords_removed" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=„stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is defined as follows: 
> 
> <field name="expanded" type="text_multiplied" indexed="true" stored="true" required="true“/>expanded
> 
>    <fieldType name="text_expanded" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: 
> 
>  <copyField source="original" dest="stopwords_removed"/>
>  <copyField source="original" dest="expanded“/>
> 
> Now, my expectation would be as follows: 
> - during import, two temporary fields are created by copying content from the original field
> - these two temporary fields are then pre-processed as per the definitions above
> - the pre-processed version of the text is added to the index
> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. 
> 
> However, what happens actually is that I get matches only for „Sprache“ and „sprache“. 
> 
> The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: 
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true <http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>
> 
> will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he would like to carry out: A standard search (field original) or an expanded search (one of the other two fields). 
> 
> For debugging, I have checked the analysis and results seem ok (posted below). 
> Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading and googling). It is probably something simple that I missing. 
> Thanks a lot in advance for any help. 
> 
> Cheers, 
> 
> Martin
> 
> 
> ST
> Was
> zum
> Wesen
> 
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
> 
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
> 
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken

Re: Applying Tokenizers and Filters to CopyFields

Posted by Erick Erickson <er...@gmail.com>.

Glad it worked out...

Looking back, I can't believe I didn't mention adding &debug=query to
the URL. That would have shown you exactly what the parsed query
looked like and you'd have seen right off that it wasn't searching
against the field you thought it was. It's one of the first things I
do when queries don't return what I expect.

Glad it's working for you!
Erick

On Thu, Mar 26, 2015 at 8:24 AM, Michael Della Bitta
<mi...@appinions.com> wrote:
> Glad you are sorted out!
>
> Michael Della Bitta
>
> Senior Software Engineer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
> On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich <ma...@gmx.net>
> wrote:
>
>> Thanks so much, Erick and Michael, for all the additional explanation.
>> The crucial information in the end turned out to be the one about the
>> Default Search Field („df“). In solrconfig.xml this parameter was to point
>> to the original text, which is why the expanded queries didn’t work. When I
>> set the df parameter to one of the fields with the expanded text, the
>> search works fine. I have also removed the copyField declarations.
>>
>> It’s all working as expected now. Thanks again for the help.
>>
>> Cheers,
>>
>> Martin
>>
>>
>>
>>
>> > Am 25.03.2015 um 23:43 schrieb Erick Erickson <er...@gmail.com>:
>> >
>> > Martin:
>> > Perhaps this would help
>> >
>> > indexed=true, stored=true
>> > field can be searched. The raw input (not analyzed in any way) can be
>> > shown to the user in the results list.
>> >
>> > indexed=true, stored=false
>> > field can be searched. However, the field can't be returned in the
>> > results list with the document.
>> >
>> > indexed=false, stored=true
>> > The field cannot be searched, but the contents can be returned in the
>> > results list with the document. There are some use-cases where this is
>> > desirable behavior.
>> >
>> > indexed=false, stored=false
>> > The entire field is thrown out, it's just as if you didn't send the
>> > field to be indexed at all.
>> >
>> > And one other thing, the copyField gets the _raw_ data not the
>> > analyzed data. Let's say you have two fields, "src" and "dst".
>> > copying from src to dest in schema.xml is identical to
>> > <add>
>> >  <doc>
>> >    <field name=src>original text</field>
>> >   <field name=dst>original text</field>
>> > </doc>
>> > </add>
>> >
>> > that is, copyfield directives are not chained.
>> >
>> > Also, watch out for your query syntax. Michael's comments are spot-on,
>> > I'd just add this:
>> >
>> >
>> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>> >
>> > is kind of odd. Let's assume you mean "qf" rather than "fq". That
>> > _only_ matters if your query parser is "edismax", it'll be ignored in
>> > this case I believe.
>> >
>> > You'd want something like
>> > q=src:Sprache
>> > or
>> > q=dst:Sprache
>> > or even
>> > http://localhost:8983/solr/windex/select?q=Sprache&df=src
>> > http://localhost:8983/solr/windex/select?q=Sprache&df=dst
>> >
>> > where "df" is "default field" and the search is applied against that
>> > field in the absence of a field qualification like my first two
>> > examples.
>> >
>> > Best,
>> > Erick
>> >
>> > On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
>> > <mi...@appinions.com> wrote:
>> >> I agree the terminology is possibly a little confusing.
>> >>
>> >> Stored refers to values that are stored verbatim. You can retrieve them
>> >> verbatim. Analysis does not affect stored values.
>> >> Indexed values are tokenized/transformed and stored inverted. You can't
>> >> recover the literal analyzed version (at least, not easily).
>> >>
>> >> If what you really want is to store and retrieve case folded versions of
>> >> your data as well as the original, you need to use something like a
>> >> UpdateRequestProcessor, which I personally am less familiar with.
>> >>
>> >>
>> >> On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich <ma...@gmx.net>
>> >> wrote:
>> >>
>> >>> So, the pre-processing steps are applied under <analyzer type=„index“>.
>> >>> And this point is not quite clear to me: Assuming that I have a simple
>> >>> case-folding step applied to the target of the copyField: How or where
>> are
>> >>> the lower-case tokens stored, if the text isn’t added to the index?
>> How is
>> >>> the query supposed to retrieve the lower-case version?
>> >>> (sorry, if this sounds like a naive question, but I have a feeling
>> that I
>> >>> am missing something really basic here).
>> >>>
>> >>
>> >>
>> >> Michael Della Bitta
>> >>
>> >> Senior Software Engineer
>> >>
>> >> o: +1 646 532 3062
>> >>
>> >> appinions inc.
>> >>
>> >> “The Science of Influence Marketing”
>> >>
>> >> 18 East 41st Street
>> >>
>> >> New York, NY 10017
>> >>
>> >> t: @appinions <https://twitter.com/Appinions> | g+:
>> >> plus.google.com/appinions
>> >> <
>> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>> >
>> >> w: appinions.com <http://www.appinions.com/>
>>
>>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Michael Della Bitta <mi...@appinions.com>.

Glad you are sorted out!

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich <ma...@gmx.net>
wrote:

> Thanks so much, Erick and Michael, for all the additional explanation.
> The crucial information in the end turned out to be the one about the
> Default Search Field („df“). In solrconfig.xml this parameter was to point
> to the original text, which is why the expanded queries didn’t work. When I
> set the df parameter to one of the fields with the expanded text, the
> search works fine. I have also removed the copyField declarations.
>
> It’s all working as expected now. Thanks again for the help.
>
> Cheers,
>
> Martin
>
>
>
>
> > Am 25.03.2015 um 23:43 schrieb Erick Erickson <er...@gmail.com>:
> >
> > Martin:
> > Perhaps this would help
> >
> > indexed=true, stored=true
> > field can be searched. The raw input (not analyzed in any way) can be
> > shown to the user in the results list.
> >
> > indexed=true, stored=false
> > field can be searched. However, the field can't be returned in the
> > results list with the document.
> >
> > indexed=false, stored=true
> > The field cannot be searched, but the contents can be returned in the
> > results list with the document. There are some use-cases where this is
> > desirable behavior.
> >
> > indexed=false, stored=false
> > The entire field is thrown out, it's just as if you didn't send the
> > field to be indexed at all.
> >
> > And one other thing, the copyField gets the _raw_ data not the
> > analyzed data. Let's say you have two fields, "src" and "dst".
> > copying from src to dest in schema.xml is identical to
> > <add>
> >  <doc>
> >    <field name=src>original text</field>
> >   <field name=dst>original text</field>
> > </doc>
> > </add>
> >
> > that is, copyfield directives are not chained.
> >
> > Also, watch out for your query syntax. Michael's comments are spot-on,
> > I'd just add this:
> >
> >
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> >
> > is kind of odd. Let's assume you mean "qf" rather than "fq". That
> > _only_ matters if your query parser is "edismax", it'll be ignored in
> > this case I believe.
> >
> > You'd want something like
> > q=src:Sprache
> > or
> > q=dst:Sprache
> > or even
> > http://localhost:8983/solr/windex/select?q=Sprache&df=src
> > http://localhost:8983/solr/windex/select?q=Sprache&df=dst
> >
> > where "df" is "default field" and the search is applied against that
> > field in the absence of a field qualification like my first two
> > examples.
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
> > <mi...@appinions.com> wrote:
> >> I agree the terminology is possibly a little confusing.
> >>
> >> Stored refers to values that are stored verbatim. You can retrieve them
> >> verbatim. Analysis does not affect stored values.
> >> Indexed values are tokenized/transformed and stored inverted. You can't
> >> recover the literal analyzed version (at least, not easily).
> >>
> >> If what you really want is to store and retrieve case folded versions of
> >> your data as well as the original, you need to use something like a
> >> UpdateRequestProcessor, which I personally am less familiar with.
> >>
> >>
> >> On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich <ma...@gmx.net>
> >> wrote:
> >>
> >>> So, the pre-processing steps are applied under <analyzer type=„index“>.
> >>> And this point is not quite clear to me: Assuming that I have a simple
> >>> case-folding step applied to the target of the copyField: How or where
> are
> >>> the lower-case tokens stored, if the text isn’t added to the index?
> How is
> >>> the query supposed to retrieve the lower-case version?
> >>> (sorry, if this sounds like a naive question, but I have a feeling
> that I
> >>> am missing something really basic here).
> >>>
> >>
> >>
> >> Michael Della Bitta
> >>
> >> Senior Software Engineer
> >>
> >> o: +1 646 532 3062
> >>
> >> appinions inc.
> >>
> >> “The Science of Influence Marketing”
> >>
> >> 18 East 41st Street
> >>
> >> New York, NY 10017
> >>
> >> t: @appinions <https://twitter.com/Appinions> | g+:
> >> plus.google.com/appinions
> >> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> >> w: appinions.com <http://www.appinions.com/>
>
>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Martin Wunderlich <ma...@gmx.net>.

Thanks so much, Erick and Michael, for all the additional explanation. 
The crucial information in the end turned out to be the one about the Default Search Field („df“). In solrconfig.xml this parameter was to point to the original text, which is why the expanded queries didn’t work. When I set the df parameter to one of the fields with the expanded text, the search works fine. I have also removed the copyField declarations. 

It’s all working as expected now. Thanks again for the help. 

Cheers, 

Martin
 



> Am 25.03.2015 um 23:43 schrieb Erick Erickson <er...@gmail.com>:
> 
> Martin:
> Perhaps this would help
> 
> indexed=true, stored=true
> field can be searched. The raw input (not analyzed in any way) can be
> shown to the user in the results list.
> 
> indexed=true, stored=false
> field can be searched. However, the field can't be returned in the
> results list with the document.
> 
> indexed=false, stored=true
> The field cannot be searched, but the contents can be returned in the
> results list with the document. There are some use-cases where this is
> desirable behavior.
> 
> indexed=false, stored=false
> The entire field is thrown out, it's just as if you didn't send the
> field to be indexed at all.
> 
> And one other thing, the copyField gets the _raw_ data not the
> analyzed data. Let's say you have two fields, "src" and "dst".
> copying from src to dest in schema.xml is identical to
> <add>
>  <doc>
>    <field name=src>original text</field>
>   <field name=dst>original text</field>
> </doc>
> </add>
> 
> that is, copyfield directives are not chained.
> 
> Also, watch out for your query syntax. Michael's comments are spot-on,
> I'd just add this:
> 
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> 
> is kind of odd. Let's assume you mean "qf" rather than "fq". That
> _only_ matters if your query parser is "edismax", it'll be ignored in
> this case I believe.
> 
> You'd want something like
> q=src:Sprache
> or
> q=dst:Sprache
> or even
> http://localhost:8983/solr/windex/select?q=Sprache&df=src
> http://localhost:8983/solr/windex/select?q=Sprache&df=dst
> 
> where "df" is "default field" and the search is applied against that
> field in the absence of a field qualification like my first two
> examples.
> 
> Best,
> Erick
> 
> On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
> <mi...@appinions.com> wrote:
>> I agree the terminology is possibly a little confusing.
>> 
>> Stored refers to values that are stored verbatim. You can retrieve them
>> verbatim. Analysis does not affect stored values.
>> Indexed values are tokenized/transformed and stored inverted. You can't
>> recover the literal analyzed version (at least, not easily).
>> 
>> If what you really want is to store and retrieve case folded versions of
>> your data as well as the original, you need to use something like a
>> UpdateRequestProcessor, which I personally am less familiar with.
>> 
>> 
>> On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich <ma...@gmx.net>
>> wrote:
>> 
>>> So, the pre-processing steps are applied under <analyzer type=„index“>.
>>> And this point is not quite clear to me: Assuming that I have a simple
>>> case-folding step applied to the target of the copyField: How or where are
>>> the lower-case tokens stored, if the text isn’t added to the index? How is
>>> the query supposed to retrieve the lower-case version?
>>> (sorry, if this sounds like a naive question, but I have a feeling that I
>>> am missing something really basic here).
>>> 
>> 
>> 
>> Michael Della Bitta
>> 
>> Senior Software Engineer
>> 
>> o: +1 646 532 3062
>> 
>> appinions inc.
>> 
>> “The Science of Influence Marketing”
>> 
>> 18 East 41st Street
>> 
>> New York, NY 10017
>> 
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
>> w: appinions.com <http://www.appinions.com/>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Erick Erickson <er...@gmail.com>.

Martin:
Perhaps this would help

indexed=true, stored=true
field can be searched. The raw input (not analyzed in any way) can be
shown to the user in the results list.

indexed=true, stored=false
field can be searched. However, the field can't be returned in the
results list with the document.

indexed=false, stored=true
The field cannot be searched, but the contents can be returned in the
results list with the document. There are some use-cases where this is
desirable behavior.

indexed=false, stored=false
The entire field is thrown out, it's just as if you didn't send the
field to be indexed at all.

And one other thing, the copyField gets the _raw_ data not the
analyzed data. Let's say you have two fields, "src" and "dst".
copying from src to dest in schema.xml is identical to
<add>
  <doc>
    <field name=src>original text</field>
   <field name=dst>original text</field>
</doc>
</add>

that is, copyfield directives are not chained.

Also, watch out for your query syntax. Michael's comments are spot-on,
I'd just add this:

http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true

is kind of odd. Let's assume you mean "qf" rather than "fq". That
_only_ matters if your query parser is "edismax", it'll be ignored in
this case I believe.

You'd want something like
q=src:Sprache
or
q=dst:Sprache
or even
http://localhost:8983/solr/windex/select?q=Sprache&df=src
http://localhost:8983/solr/windex/select?q=Sprache&df=dst

where "df" is "default field" and the search is applied against that
field in the absence of a field qualification like my first two
examples.

Best,
Erick

On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
<mi...@appinions.com> wrote:
> I agree the terminology is possibly a little confusing.
>
> Stored refers to values that are stored verbatim. You can retrieve them
> verbatim. Analysis does not affect stored values.
> Indexed values are tokenized/transformed and stored inverted. You can't
> recover the literal analyzed version (at least, not easily).
>
> If what you really want is to store and retrieve case folded versions of
> your data as well as the original, you need to use something like a
> UpdateRequestProcessor, which I personally am less familiar with.
>
>
> On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich <ma...@gmx.net>
> wrote:
>
>> So, the pre-processing steps are applied under <analyzer type=„index“>.
>> And this point is not quite clear to me: Assuming that I have a simple
>> case-folding step applied to the target of the copyField: How or where are
>> the lower-case tokens stored, if the text isn’t added to the index? How is
>> the query supposed to retrieve the lower-case version?
>> (sorry, if this sounds like a naive question, but I have a feeling that I
>> am missing something really basic here).
>>
>
>
> Michael Della Bitta
>
> Senior Software Engineer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Michael Della Bitta <mi...@appinions.com>.

I agree the terminology is possibly a little confusing.

Stored refers to values that are stored verbatim. You can retrieve them
verbatim. Analysis does not affect stored values.
Indexed values are tokenized/transformed and stored inverted. You can't
recover the literal analyzed version (at least, not easily).

If what you really want is to store and retrieve case folded versions of
your data as well as the original, you need to use something like a
UpdateRequestProcessor, which I personally am less familiar with.

On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich <ma...@gmx.net>
wrote:

> So, the pre-processing steps are applied under <analyzer type=„index“>.
> And this point is not quite clear to me: Assuming that I have a simple
> case-folding step applied to the target of the copyField: How or where are
> the lower-case tokens stored, if the text isn’t added to the index? How is
> the query supposed to retrieve the lower-case version?
> (sorry, if this sounds like a naive question, but I have a feeling that I
> am missing something really basic here).
>

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Martin Wunderlich <ma...@gmx.net>.

Thanks a lot, Michael. See replies below. 


> Am 25.03.2015 um 21:41 schrieb Michael Della Bitta <mi...@appinions.com>:
> 
> Two other things I noticed:
> 
> 1. You probably don't want to store your copyFields. That's literally going
> to be the same information each time.

OK, got it. I have set the targets of the copy fields to store=„false“. 

> 
> 2. Your expectation "the pre-processed version of the text is added to the
> index" may be incorrect. Anything done in <analyzer type="query"> sections
> actually happens at query time. Not sure if that's significant for you.

I was actually referring to what is happening at index time. So, the pre-processing steps are applied under <analyzer type=„index“>. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? 
(sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). 

Cheers, 

Martin
 

> 
> 
> Michael Della Bitta
> 
> Senior Software Engineer
> 
> o: +1 646 532 3062
> 
> appinions inc.
> 
> “The Science of Influence Marketing”
> 
> 18 East 41st Street
> 
> New York, NY 10017
> 
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
> 
> On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
> 
>> Hi Martin,
>> 
>> fq means filter query. May be you want to use qf (query fields) parameter
>> of edismax?
>> 
>> 
>> 
>> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <ma...@gmx.net>
>> wrote:
>> Hi all,
>> 
>> I am wondering what the process is for applying Tokenizers and Filter (as
>> defined in the FieldType definition) to field contents that result from
>> CopyFields. To be more specific, in my Solr instance, Iwould like to
>> support query expansion by two means: removing stop words and adding
>> inflected word forms as synonyms.
>> 
>> To use a specific example, let’s say I have the following sentence to be
>> indexed (from a Wittgenstein manuscript):
>> 
>> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
>> 
>> 
>> This sentence will be indexed in a field called „original“ that is defined
>> as follows:
>> 
>> <field name="original" type="text_original" indexed="true" stored="true"
>> required="true“/>
>> 
>>    <fieldType name="text_windex_original" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Then, in order to create fields for the two types of query expansion, I
>> have set up specific fields for this:
>> 
>> - one field where stopwords are removed both on the indexed content and
>> the query. So, if the users is searching for a phrase like „der Sprache“,
>> Solr should still find the segment above, because the determiners („der“
>> and „die“) are removed prior to indexing and prior to querying,
>> respectively. This field is defined as follows:
>> 
>> <field name="stopwords_removed" type="text_stopwords_removed"
>> indexed="true" stored="true" required="true“/>
>> 
>>    <fieldType name="text_stopwords_removed" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words=„stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> - a second field where synonyms are added to the query so that more
>> segments will be found. For instance, if the user is searching for the
>> plural form „Sprachen“, Solr should return the segment above, due to this
>> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is
>> defined as follows:
>> 
>> <field name="expanded" type="text_multiplied" indexed="true" stored="true"
>> required="true“/>expanded
>> 
>>    <fieldType name="text_expanded" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_de.txt" format="snowball"/>
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> Finally, to avoid having to specify three fields with identical content in
>> the import documents, I am defining the two fields for query expansion as
>> copyFields:
>> 
>>  <copyField source="original" dest="stopwords_removed"/>
>>  <copyField source="original" dest="expanded“/>
>> 
>> Now, my expectation would be as follows:
>> - during import, two temporary fields are created by copying content from
>> the original field
>> - these two temporary fields are then pre-processed as per the definitions
>> above
>> - the pre-processed version of the text is added to the index
>> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
>> Sprache“ and will always get the segment above as a matching result.
>> 
>> However, what happens actually is that I get matches only for „Sprache“
>> and „sprache“.
>> 
>> The other thing that strikes as odd, is that when I restrict the search to
>> one of the fields only using the „fq“ parameter, I get no results. For
>> instance:
>> 
>> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>> <
>> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>>> 
>> 
>> will return no matches. I would expected that using the fq parameter the
>> user can specify what type of search (s)he would like to carry out: A
>> standard search (field original) or an expanded search (one of the other
>> two fields).
>> 
>> For debugging, I have checked the analysis and results seem ok (posted
>> below).
>> Apologies for the long post, but I am really a bit stuck here (even after
>> doing a lot of reading and googling). It is probably something simple that
>> I missing.
>> Thanks a lot in advance for any help.
>> 
>> Cheers,
>> 
>> Martin
>> 
>> 
>> ST
>> Was
>> zum
>> Wesen
>> 
>> der
>> Welt
>> gehört
>> kann
>> die
>> Sprache
>> nicht
>> ausdrücken
>> SF
>> Was
>> zum
>> Wesen
>> 
>> Welt
>> gehört
>> kann
>> die
>> Sprache
>> nicht
>> ausdrücken
>> LCF
>> was
>> zum
>> wesen
>> 
>> welt
>> gehört
>> kann
>> die
>> sprache
>> nicht
>> ausdrücken
>>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Michael Della Bitta <mi...@appinions.com>.

Two other things I noticed:

1. You probably don't want to store your copyFields. That's literally going
to be the same information each time.

2. Your expectation "the pre-processed version of the text is added to the
index" may be incorrect. Anything done in <analyzer type="query"> sections
actually happens at query time. Not sure if that's significant for you.


Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Martin,
>
> fq means filter query. May be you want to use qf (query fields) parameter
> of edismax?
>
>
>
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <ma...@gmx.net>
> wrote:
> Hi all,
>
> I am wondering what the process is for applying Tokenizers and Filter (as
> defined in the FieldType definition) to field contents that result from
> CopyFields. To be more specific, in my Solr instance, Iwould like to
> support query expansion by two means: removing stop words and adding
> inflected word forms as synonyms.
>
> To use a specific example, let’s say I have the following sentence to be
> indexed (from a Wittgenstein manuscript):
>
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
>
>
> This sentence will be indexed in a field called „original“ that is defined
> as follows:
>
> <field name="original" type="text_original" indexed="true" stored="true"
> required="true“/>
>
>     <fieldType name="text_windex_original" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>       </analyzer>
>     </fieldType>
>
>
> Then, in order to create fields for the two types of query expansion, I
> have set up specific fields for this:
>
> - one field where stopwords are removed both on the indexed content and
> the query. So, if the users is searching for a phrase like „der Sprache“,
> Solr should still find the segment above, because the determiners („der“
> and „die“) are removed prior to indexing and prior to querying,
> respectively. This field is defined as follows:
>
> <field name="stopwords_removed" type="text_stopwords_removed"
> indexed="true" stored="true" required="true“/>
>
>     <fieldType name="text_stopwords_removed" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words=„stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>
> - a second field where synonyms are added to the query so that more
> segments will be found. For instance, if the user is searching for the
> plural form „Sprachen“, Solr should return the segment above, due to this
> entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is
> defined as follows:
>
> <field name="expanded" type="text_multiplied" indexed="true" stored="true"
> required="true“/>expanded
>
>     <fieldType name="text_expanded" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_de.txt" format="snowball"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> Finally, to avoid having to specify three fields with identical content in
> the import documents, I am defining the two fields for query expansion as
> copyFields:
>
>   <copyField source="original" dest="stopwords_removed"/>
>   <copyField source="original" dest="expanded“/>
>
> Now, my expectation would be as follows:
> - during import, two temporary fields are created by copying content from
> the original field
> - these two temporary fields are then pre-processed as per the definitions
> above
> - the pre-processed version of the text is added to the index
> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
> Sprache“ and will always get the segment above as a matching result.
>
> However, what happens actually is that I get matches only for „Sprache“
> and „sprache“.
>
> The other thing that strikes as odd, is that when I restrict the search to
> one of the fields only using the „fq“ parameter, I get no results. For
> instance:
>
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> <
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
> >
>
> will return no matches. I would expected that using the fq parameter the
> user can specify what type of search (s)he would like to carry out: A
> standard search (field original) or an expanded search (one of the other
> two fields).
>
> For debugging, I have checked the analysis and results seem ok (posted
> below).
> Apologies for the long post, but I am really a bit stuck here (even after
> doing a lot of reading and googling). It is probably something simple that
> I missing.
> Thanks a lot in advance for any help.
>
> Cheers,
>
> Martin
>
>
> ST
> Was
> zum
> Wesen
>
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
>
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
>
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken
>

Re: Applying Tokenizers and Filters to CopyFields

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Martin,

fq means filter query. May be you want to use qf (query fields) parameter of edismax?



On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <ma...@gmx.net> wrote:
Hi all, 

I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. 

To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): 

"Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“


This sentence will be indexed in a field called „original“ that is defined as follows: 

<field name="original" type="text_original" indexed="true" stored="true" required="true“/>

    <fieldType name="text_windex_original" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>


Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: 

- one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: 

<field name="stopwords_removed" type="text_stopwords_removed" indexed="true" stored="true" required="true“/>

    <fieldType name="text_stopwords_removed" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words=„stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


- a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: "Sprache,Sprach,Sprachen“. This field is defined as follows: 

<field name="expanded" type="text_multiplied" indexed="true" stored="true" required="true“/>expanded

    <fieldType name="text_expanded" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt" format="snowball"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: 

  <copyField source="original" dest="stopwords_removed"/>
  <copyField source="original" dest="expanded“/>

Now, my expectation would be as follows: 
- during import, two temporary fields are created by copying content from the original field
- these two temporary fields are then pre-processed as per the definitions above
- the pre-processed version of the text is added to the index
- then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. 

However, what happens actually is that I get matches only for „Sprache“ and „sprache“. 

The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: 
http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true <http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>

will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he would like to carry out: A standard search (field original) or an expanded search (one of the other two fields). 

For debugging, I have checked the analysis and results seem ok (posted below). 
Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading and googling). It is probably something simple that I missing. 
Thanks a lot in advance for any help. 

Cheers, 

Martin


ST
Was
zum
Wesen

der
Welt
gehört
kann
die
Sprache
nicht
ausdrücken
SF
Was
zum
Wesen

Welt
gehört
kann
die
Sprache
nicht
ausdrücken
LCF
was
zum
wesen

welt
gehört
kann
die
sprache
nicht
ausdrücken