You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by matthew sporleder <ms...@gmail.com> on 2020/04/24 23:48:24 UTC

stored=true what should I see from stem fields

Is what is shown in "analysis" the same as what is stored in a field?

I am confusing myself pretty thoroughly:

I have some fields:
  <fieldType name="string_raw" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
     <analyzer type="index">
          <tokenizer class="solr.KeywordTokenizerFactory"/>
     </analyzer>

<fieldType name="stems" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <filter class="solr.LowerCaseFilterFactory" />
   <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
   <filter class="solr.EnglishPossessiveFilterFactory"/>
   <filter class="solr.PorterStemFilterFactory"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
   <tokenizer class="solr.StandardTokenizerFactory"/>
 </analyzer>

  <fieldType name="everything" class="solr.TextField"
positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>

<field name="stuff_raw" type="string_raw" indexed="true" stored="true"
multiValued="false" />
<field name="stuff_stems" type="stems" indexed="true" stored="true"
multiValued="false" />
 <field name="stuff_everything" type="everything" indexed="true"
stored="true" multiValued="true" />


And I have this:
 <copyField source="stuff_raw" dest="stuff_everything"/>
 <copyField source="stuff_raw" dest="stuff_stems"/>
 <copyField source="stuff_stems" dest="stuff_everything"/>


I run this through the analyzer for stuff_stems:
"the quick brown fox jumped over the sleeping dog"

It prints out a bunch of stuff but the last thing it says is:
"quick brown fox jump over sleep dog"

So far so good.

So I indexed a document with "the quick brown fox jumped over the
sleeping dog" set for stuff_raw and when I query for the document
stuff_stems just has "the quick brown fox jumped over the sleeping
dog" and NOT "quick brown fox jump over sleep dog"

Also stuff_everything only contains a single item, which is weird
because I copy two things into it.

In fact here is everything:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
      {
        "id":1,
        "stuff_raw":"the quick brown fox jumped over the sleeping dog",
        "stuff_stems":"the quick brown fox jumped over the sleeping dog",
        "stuff_everything":["the quick brown fox jumped over the sleeping dog"],
        "_version_":1664899022194737152,
        "timestamp":"2020-04-24T23:37:16.877Z",
        "score":1.0},
      {
        "id":2,
        "stuff_raw":"jumped jumping jumper",
        "stuff_stems":"jumped jumping jumper",
        "stuff_everything":["jumped jumping jumper"],
        "_version_":1664899046865633280,
        "timestamp":"2020-04-24T23:37:40.404Z",
        "score":1.0}]
  }}

Re: stored=true what should I see from stem fields

Posted by matthew sporleder <ms...@gmail.com>.
the quick brown fox jumped over the sleeping dogI was just doing that
to troubleshoot/discover.  I knew that you couldn't copy-to-copy but,
apparently, needed to be reminded.

My end goal (which I don't think I can achieve?) was to get my
everything field to contain something like:
everything: [ 'the quick brown fox jumped over the sleeping dog',
                    'quick brown fox jump over sleep dog']

So that a single/simple query would match that doc for q=dog or q=jump
or q=sleeping and would score extra high for "the dog jump", but I
guess I will need to change the query logic to search on both fields.

On Sat, Apr 25, 2020 at 8:16 AM Erick Erickson <er...@gmail.com> wrote:
>
> One other bit:
>
> There’s rarely a reason to, and multiple reasons _not_ to set stored=true for
> the _destination_ of a copyField, set it for the source field.
>
> If you need to retrieve the original, just specify the source field in the fl list.
>
> Best,
> Erick
>
> > On Apr 24, 2020, at 8:42 PM, Chris Hostetter <ho...@fucit.org> wrote:
> >
> >
> > : Is what is shown in "analysis" the same as what is stored in a field?
> >
> > https://lucene.apache.org/solr/guide/8_5/analyzers.html
> >
> > The output of an Analyzer affects the terms indexed in a given field (and
> > the terms used when parsing queries against those fields) but it has no
> > impact on the stored value for the fields. For example: an analyzer might
> > split "Brown Cow" into two indexed terms "brown" and "cow", but the stored
> > value will still be a single String: "Brown Cow"
> >
> >
> > : So I indexed a document with "the quick brown fox jumped over the
> > : sleeping dog" set for stuff_raw and when I query for the document
> > : stuff_stems just has "the quick brown fox jumped over the sleeping
> > : dog" and NOT "quick brown fox jump over sleep dog"
> >
> >
> > https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> >
> > Fields are copied before analysis is done, meaning you can have two
> > fields with identical original content, but which use different analysis
> > chains and are stored in the index differently.
> >
> >
> >
> > : Also stuff_everything only contains a single item, which is weird
> > : because I copy two things into it.
> >
> > https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> >
> > Copying is done at the stream source level and no copy feeds into another
> > copy. This means that copy fields cannot be chained i.e., you cannot copy
> > from here to there and then from there to elsewhere. However, the same
> > source field can be copied to multiple destination fields:
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>

Re: stored=true what should I see from stem fields

Posted by Erick Erickson <er...@gmail.com>.
One other bit:

There’s rarely a reason to, and multiple reasons _not_ to set stored=true for
the _destination_ of a copyField, set it for the source field. 

If you need to retrieve the original, just specify the source field in the fl list.

Best,
Erick

> On Apr 24, 2020, at 8:42 PM, Chris Hostetter <ho...@fucit.org> wrote:
> 
> 
> : Is what is shown in "analysis" the same as what is stored in a field?
> 
> https://lucene.apache.org/solr/guide/8_5/analyzers.html
> 
> The output of an Analyzer affects the terms indexed in a given field (and 
> the terms used when parsing queries against those fields) but it has no 
> impact on the stored value for the fields. For example: an analyzer might 
> split "Brown Cow" into two indexed terms "brown" and "cow", but the stored 
> value will still be a single String: "Brown Cow"
> 
> 
> : So I indexed a document with "the quick brown fox jumped over the
> : sleeping dog" set for stuff_raw and when I query for the document
> : stuff_stems just has "the quick brown fox jumped over the sleeping
> : dog" and NOT "quick brown fox jump over sleep dog"
> 
> 
> https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> 
> Fields are copied before analysis is done, meaning you can have two 
> fields with identical original content, but which use different analysis 
> chains and are stored in the index differently.
> 
> 
> 
> : Also stuff_everything only contains a single item, which is weird
> : because I copy two things into it.
> 
> https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> 
> Copying is done at the stream source level and no copy feeds into another 
> copy. This means that copy fields cannot be chained i.e., you cannot copy 
> from here to there and then from there to elsewhere. However, the same 
> source field can be copied to multiple destination fields:
> 
> 
> -Hoss
> http://www.lucidworks.com/


Re: stored=true what should I see from stem fields

Posted by Chris Hostetter <ho...@fucit.org>.
: Is what is shown in "analysis" the same as what is stored in a field?

https://lucene.apache.org/solr/guide/8_5/analyzers.html

The output of an Analyzer affects the terms indexed in a given field (and 
the terms used when parsing queries against those fields) but it has no 
impact on the stored value for the fields. For example: an analyzer might 
split "Brown Cow" into two indexed terms "brown" and "cow", but the stored 
value will still be a single String: "Brown Cow"


: So I indexed a document with "the quick brown fox jumped over the
: sleeping dog" set for stuff_raw and when I query for the document
: stuff_stems just has "the quick brown fox jumped over the sleeping
: dog" and NOT "quick brown fox jump over sleep dog"


https://lucene.apache.org/solr/guide/8_5/copying-fields.html

Fields are copied before analysis is done, meaning you can have two 
fields with identical original content, but which use different analysis 
chains and are stored in the index differently.



: Also stuff_everything only contains a single item, which is weird
: because I copy two things into it.

https://lucene.apache.org/solr/guide/8_5/copying-fields.html

Copying is done at the stream source level and no copy feeds into another 
copy. This means that copy fields cannot be chained i.e., you cannot copy 
from here to there and then from there to elsewhere. However, the same 
source field can be copied to multiple destination fields:


-Hoss
http://www.lucidworks.com/

Re: stored=true what should I see from stem fields

Posted by Shawn Heisey <ap...@elyograg.org>.
On 4/24/2020 5:48 PM, matthew sporleder wrote:
> Is what is shown in "analysis" the same as what is stored in a field?

The stored data (what you see in search results) is always exactly what 
was sent to Solr, modified by any update processors that are in use.

The index (what you are actually searching) will contain the results of 
analysis, but the stored data will not reflect that.

If you are copying fields for different analysis on the same input data, 
only one of them will need to be stored.  There is usually no need to 
have a raw field in addition to an analyzed field, since the stored data 
on the analyzed field will be exactly the same.

Thanks,
Shawn