You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2016/07/05 21:34:23 UTC

Getting a hit on "the}" but not on "the" or "}"

HI Everyone,

I'm trying to understand why I get a hit when I search for "the}" but not
when I search for "the" (searches are done without the quotes and "the" is
a stopword in my case).

Here is the debugQuery output using "the}":
  "debug": {
    "rawquerystring": "the}",
    "querystring": "the}",
    "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
ALL_FIELDS:the))~1.0))/no_coord",
    "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
    "explain": {
      "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.14220011
= score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product of:\n
       1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.078125 = fieldNorm(doc=0)\n",
      "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.14220011
= score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product of:\n
       1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.078125 = fieldNorm(doc=0)\n",
      "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n    0.14220011
= score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product of:\n
       1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.078125 = fieldNorm(doc=1)\n",
      "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.1137601
= score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product of:\n
       1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
 0.0625 = fieldNorm(doc=0)\n"
    },
    "QParser": "ExtendedDismaxQParser",
    "altquerystring": null,
    "boost_queries": null,
    "parsed_boost_queries": [],
    "boostfuncs": null,
    "filter_queries": [
      "ISBN_GROUP_ID:2"
    ],
    "parsed_filter_queries": [
      "ISBN_GROUP_ID:2"
    ],

Here is the debugQuery output using "the"
  "debug": {
    "rawquerystring": "the",
    "querystring": "the",
    "parsedquery": "(+())/no_coord",
    "parsedquery_toString": "+()",
    "explain": {},
    "QParser": "ExtendedDismaxQParser",
    "altquerystring": null,
    "boost_queries": null,
    "parsed_boost_queries": [],
    "boostfuncs": null,
    "filter_queries": [
      "ISBN_GROUP_ID:2"
    ],
    "parsed_filter_queries": [
      "ISBN_GROUP_ID:2"
    ],

As expected, I get no hits when I search for just "}":
  "debug": {
    "rawquerystring": "}",
    "querystring": "}",
    "parsedquery": "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
    "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
    "explain": {},
    "QParser": "ExtendedDismaxQParser",
    "altquerystring": null,
    "boost_queries": null,
    "parsed_boost_queries": [],
    "boostfuncs": null,
    "filter_queries": [
      "ISBN_GROUP_ID:2"
    ],
    "parsed_filter_queries": [
      "ISBN_GROUP_ID:2"
    ],

In case it matters, I'm also getting a hit when I search for "the." or
"the]" or "the/" or "the," or "the=" etc.

Thanks in advanced.

Steve

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Erick Erickson <er...@gmail.com>.
Yes and No. WDFF does, indeed, break things up. But they're
also sequential and you can often get what you want
via phrase searches.

But what you have now puts junk in your index. What use is
"the}" as a single token? It's up to you, but consider
cleaning that sort of stuff up with, say a regex filter that
preserves e-mail addresses but strips other stuff. This gets
a little tricky as you have to decide what constitutes "garbage"
which is not all that easy. for instance:
phone numbers. are hyphens, parens and the like "garbage"?
SSN numbers.
etc....

Best,
Erick

On Wed, Jul 6, 2016 at 6:28 AM, Steven White <sw...@gmail.com> wrote:
> Thanks Erick.  Moving stopword factory to after WDFF fixed the problem; I
> no longer get a hit on "the}" or the variations of "the]", "the.", etc., I
> did not have to change preserverOriginal from 1 to 0.
>
> Regarding preserverOriginal in WDFF, I have it set to 1 because my
> understanding of it means that if I have the text "abc@apache.org"
> with preserverOriginal
> set to 1 means WDFF will give me "abc", "apache", "org" and "abc@apache.org"
>  In effect, if someone searches on "abc" or "apache" or "org" as well as on
> "abc@apache.org" I will get a hit.  That is, if I set preserverOriginal to
> 0, then searching for "abc@apache.org" will not give me a hit.  My goal is
> to still get a hit on the original word not just the break down that WDFF
> gives me.  Is my understanding correct?
>
> Steve
>
>
> On Tue, Jul 5, 2016 at 7:47 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Either that's a typo or your problem is it should be terms.fl, not
>> terms.f1 (lower case ell as
>> opposed to the number one). You should be seeing the raw terms in your
>> index
>> with TermsComponent, similar to the "load terms" in the schema browser
>> except it
>> allows you to query specific terms starting with terms.prefix.
>>
>> WordDelimiterFilterFactory (WDFF) is what's stripping off your non
>> alpha-numeric
>> characters. Your stopword factory is before WDFF so
>> anything like be. (notice the period) would NOT be stripped. Then when that
>> token is passed through WDFF the period disappears. Order matters.
>>
>> You have preserverOriginal="1" in WDFF, which means the original token
>> is preserved
>> intact so "the}" gets changed to two tokens, "the" and "the}".
>>
>> So you really have to look more closely at your analysis chain, that's
>> pretty much where
>> your problems appear to be.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 5, 2016 at 4:30 PM, Steven White <sw...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > By TermsCoponent, I think you meant me to try the following?
>> >
>> >
>> >
>> http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the
>> >
>> > If so, I tried it and I'm getting 0 hits:
>> >
>> >   <response>
>> >     <lst name="responseHeader">
>> >       <int name="status">0</int>
>> >       <int name="QTime">0</int>
>> >     </lst>
>> >     <lst name="terms"/>
>> >   </response>
>> >
>> > In fact, I'm getting 0 hits on anything I pass to "terms.prefix"
>> >
>> > Another thing I noticed is this.  Using Solr Admin Console's Schema
>> > Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
>> > Info button, I'm seeing "be" in the list!!  Like so:
>> >
>> >   4 localhost
>> >     abc
>> >     abc@localhost.com
>> >     com
>> >     intern
>> >     be
>> >     /intern
>> >     abclocalhostcom
>> >     user
>> >
>> > I don't understand what I'm looking at here (in the schema browser) or if
>> > this is at all related to my issue (I'm seeing "be" listed here and
>> > wandering if it has something to do with my issue).  If I click on any of
>> > the listed words, I get a hit, but I get 0 hits when I click on "be".
>> >
>> > Thanks.
>> >
>> > Steve
>> >
>> >
>> > On Tue, Jul 5, 2016 at 7:07 PM, Steven White <sw...@gmail.com>
>> wrote:
>> >
>> >> Thanks for the quick reply Erick.
>> >>
>> >> Here is the analyzer I'm using:
>> >>
>> >>   <fieldType name="all_raw_text" class="solr.TextField"
>> >> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> >>     <analyzer>
>> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >>       <filter class="solr.StopFilterFactory"
>> words="lang/stopwords_en.txt"
>> >> ignoreCase="true"/>
>> >>       <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1"
>> >> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
>> >> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
>> >> catenateAll="1" catenateNumbers="1"/>
>> >>       <filter class="solr.LowerCaseFilterFactory"/>
>> >>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>> >>       <filter class="solr.KeywordMarkerFilterFactory"
>> >> protected="protwords.txt"/>
>> >>       <filter class="solr.PorterStemFilterFactory"/>
>> >>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>     </analyzer>
>> >>
>> >> If in fact it is my analyzer, what part of it is causing this?  If not,
>> >> I'm not clear about the "TermsComponent" that you suggested having me
>> look
>> >> into.  How do I "point" it at my field?  I have zero knowledge about
>> this.
>> >> Is this something I do from Solr's Admin Console via Schema Browser
>> link?
>> >>
>> >> Steve
>> >>
>> >>
>> >> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> My guess is that your field analysis isn't stripping the various non
>> >>> alpha-num
>> >>> characters, thus "the]" is actually a token in your index, square
>> bracket
>> >>> and
>> >>> all. If that's true, it certainly doesn't match the stopword "the".
>> >>>
>> >>> You can check by using the TermsComponent, pointing it at your field
>> >>> and setting terms.prefix=the
>> >>>
>> >>> See:
>> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>> >>>
>> >>> Best,
>> >>> Erick
>> >>>
>> >>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com>
>> >>> wrote:
>> >>> > HI Everyone,
>> >>> >
>> >>> > I'm trying to understand why I get a hit when I search for "the}" but
>> >>> not
>> >>> > when I search for "the" (searches are done without the quotes and
>> "the"
>> >>> is
>> >>> > a stopword in my case).
>> >>> >
>> >>> > Here is the debugQuery output using "the}":
>> >>> >   "debug": {
>> >>> >     "rawquerystring": "the}",
>> >>> >     "querystring": "the}",
>> >>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
>> >>> > ALL_FIELDS:the))~1.0))/no_coord",
>> >>> >     "parsedquery_toString": "+((ALL_FIELDS:the}
>> ALL_FIELDS:the))~1.0",
>> >>> >     "explain": {
>> >>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
>> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> >>> 0.14220011
>> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>> >>> of:\n
>> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.078125 = fieldNorm(doc=0)\n",
>> >>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
>> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> >>> 0.14220011
>> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>> >>> of:\n
>> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.078125 = fieldNorm(doc=0)\n",
>> >>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
>> >>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
>> >>> 0.14220011
>> >>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
>> >>> of:\n
>> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.078125 = fieldNorm(doc=1)\n",
>> >>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
>> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> >>> 0.1137601
>> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
>> >>> of:\n
>> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >>> >  0.0625 = fieldNorm(doc=0)\n"
>> >>> >     },
>> >>> >     "QParser": "ExtendedDismaxQParser",
>> >>> >     "altquerystring": null,
>> >>> >     "boost_queries": null,
>> >>> >     "parsed_boost_queries": [],
>> >>> >     "boostfuncs": null,
>> >>> >     "filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >     "parsed_filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >
>> >>> > Here is the debugQuery output using "the"
>> >>> >   "debug": {
>> >>> >     "rawquerystring": "the",
>> >>> >     "querystring": "the",
>> >>> >     "parsedquery": "(+())/no_coord",
>> >>> >     "parsedquery_toString": "+()",
>> >>> >     "explain": {},
>> >>> >     "QParser": "ExtendedDismaxQParser",
>> >>> >     "altquerystring": null,
>> >>> >     "boost_queries": null,
>> >>> >     "parsed_boost_queries": [],
>> >>> >     "boostfuncs": null,
>> >>> >     "filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >     "parsed_filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >
>> >>> > As expected, I get no hits when I search for just "}":
>> >>> >   "debug": {
>> >>> >     "rawquerystring": "}",
>> >>> >     "querystring": "}",
>> >>> >     "parsedquery":
>> >>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
>> >>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
>> >>> >     "explain": {},
>> >>> >     "QParser": "ExtendedDismaxQParser",
>> >>> >     "altquerystring": null,
>> >>> >     "boost_queries": null,
>> >>> >     "parsed_boost_queries": [],
>> >>> >     "boostfuncs": null,
>> >>> >     "filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >     "parsed_filter_queries": [
>> >>> >       "ISBN_GROUP_ID:2"
>> >>> >     ],
>> >>> >
>> >>> > In case it matters, I'm also getting a hit when I search for "the."
>> or
>> >>> > "the]" or "the/" or "the," or "the=" etc.
>> >>> >
>> >>> > Thanks in advanced.
>> >>> >
>> >>> > Steve
>> >>>
>> >>
>> >>
>>

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Steven White <sw...@gmail.com>.
Thanks Erick.  Moving stopword factory to after WDFF fixed the problem; I
no longer get a hit on "the}" or the variations of "the]", "the.", etc., I
did not have to change preserverOriginal from 1 to 0.

Regarding preserverOriginal in WDFF, I have it set to 1 because my
understanding of it means that if I have the text "abc@apache.org"
with preserverOriginal
set to 1 means WDFF will give me "abc", "apache", "org" and "abc@apache.org"
 In effect, if someone searches on "abc" or "apache" or "org" as well as on
"abc@apache.org" I will get a hit.  That is, if I set preserverOriginal to
0, then searching for "abc@apache.org" will not give me a hit.  My goal is
to still get a hit on the original word not just the break down that WDFF
gives me.  Is my understanding correct?

Steve


On Tue, Jul 5, 2016 at 7:47 PM, Erick Erickson <er...@gmail.com>
wrote:

> Either that's a typo or your problem is it should be terms.fl, not
> terms.f1 (lower case ell as
> opposed to the number one). You should be seeing the raw terms in your
> index
> with TermsComponent, similar to the "load terms" in the schema browser
> except it
> allows you to query specific terms starting with terms.prefix.
>
> WordDelimiterFilterFactory (WDFF) is what's stripping off your non
> alpha-numeric
> characters. Your stopword factory is before WDFF so
> anything like be. (notice the period) would NOT be stripped. Then when that
> token is passed through WDFF the period disappears. Order matters.
>
> You have preserverOriginal="1" in WDFF, which means the original token
> is preserved
> intact so "the}" gets changed to two tokens, "the" and "the}".
>
> So you really have to look more closely at your analysis chain, that's
> pretty much where
> your problems appear to be.
>
> Best,
> Erick
>
> On Tue, Jul 5, 2016 at 4:30 PM, Steven White <sw...@gmail.com> wrote:
> > Hi Erick,
> >
> > By TermsCoponent, I think you meant me to try the following?
> >
> >
> >
> http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the
> >
> > If so, I tried it and I'm getting 0 hits:
> >
> >   <response>
> >     <lst name="responseHeader">
> >       <int name="status">0</int>
> >       <int name="QTime">0</int>
> >     </lst>
> >     <lst name="terms"/>
> >   </response>
> >
> > In fact, I'm getting 0 hits on anything I pass to "terms.prefix"
> >
> > Another thing I noticed is this.  Using Solr Admin Console's Schema
> > Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
> > Info button, I'm seeing "be" in the list!!  Like so:
> >
> >   4 localhost
> >     abc
> >     abc@localhost.com
> >     com
> >     intern
> >     be
> >     /intern
> >     abclocalhostcom
> >     user
> >
> > I don't understand what I'm looking at here (in the schema browser) or if
> > this is at all related to my issue (I'm seeing "be" listed here and
> > wandering if it has something to do with my issue).  If I click on any of
> > the listed words, I get a hit, but I get 0 hits when I click on "be".
> >
> > Thanks.
> >
> > Steve
> >
> >
> > On Tue, Jul 5, 2016 at 7:07 PM, Steven White <sw...@gmail.com>
> wrote:
> >
> >> Thanks for the quick reply Erick.
> >>
> >> Here is the analyzer I'm using:
> >>
> >>   <fieldType name="all_raw_text" class="solr.TextField"
> >> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >>     <analyzer>
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <filter class="solr.StopFilterFactory"
> words="lang/stopwords_en.txt"
> >> ignoreCase="true"/>
> >>       <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1"
> >> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
> >> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
> >> catenateAll="1" catenateNumbers="1"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>       <filter class="solr.EnglishPossessiveFilterFactory"/>
> >>       <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="protwords.txt"/>
> >>       <filter class="solr.PorterStemFilterFactory"/>
> >>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>     </analyzer>
> >>
> >> If in fact it is my analyzer, what part of it is causing this?  If not,
> >> I'm not clear about the "TermsComponent" that you suggested having me
> look
> >> into.  How do I "point" it at my field?  I have zero knowledge about
> this.
> >> Is this something I do from Solr's Admin Console via Schema Browser
> link?
> >>
> >> Steve
> >>
> >>
> >> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerickson@gmail.com
> >
> >> wrote:
> >>
> >>> My guess is that your field analysis isn't stripping the various non
> >>> alpha-num
> >>> characters, thus "the]" is actually a token in your index, square
> bracket
> >>> and
> >>> all. If that's true, it certainly doesn't match the stopword "the".
> >>>
> >>> You can check by using the TermsComponent, pointing it at your field
> >>> and setting terms.prefix=the
> >>>
> >>> See:
> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com>
> >>> wrote:
> >>> > HI Everyone,
> >>> >
> >>> > I'm trying to understand why I get a hit when I search for "the}" but
> >>> not
> >>> > when I search for "the" (searches are done without the quotes and
> "the"
> >>> is
> >>> > a stopword in my case).
> >>> >
> >>> > Here is the debugQuery output using "the}":
> >>> >   "debug": {
> >>> >     "rawquerystring": "the}",
> >>> >     "querystring": "the}",
> >>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
> >>> > ALL_FIELDS:the))~1.0))/no_coord",
> >>> >     "parsedquery_toString": "+((ALL_FIELDS:the}
> ALL_FIELDS:the))~1.0",
> >>> >     "explain": {
> >>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=0)\n",
> >>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=0)\n",
> >>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=1)\n",
> >>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.1137601
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.0625 = fieldNorm(doc=0)\n"
> >>> >     },
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > Here is the debugQuery output using "the"
> >>> >   "debug": {
> >>> >     "rawquerystring": "the",
> >>> >     "querystring": "the",
> >>> >     "parsedquery": "(+())/no_coord",
> >>> >     "parsedquery_toString": "+()",
> >>> >     "explain": {},
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > As expected, I get no hits when I search for just "}":
> >>> >   "debug": {
> >>> >     "rawquerystring": "}",
> >>> >     "querystring": "}",
> >>> >     "parsedquery":
> >>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
> >>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
> >>> >     "explain": {},
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > In case it matters, I'm also getting a hit when I search for "the."
> or
> >>> > "the]" or "the/" or "the," or "the=" etc.
> >>> >
> >>> > Thanks in advanced.
> >>> >
> >>> > Steve
> >>>
> >>
> >>
>

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Erick Erickson <er...@gmail.com>.
Either that's a typo or your problem is it should be terms.fl, not
terms.f1 (lower case ell as
opposed to the number one). You should be seeing the raw terms in your index
with TermsComponent, similar to the "load terms" in the schema browser except it
allows you to query specific terms starting with terms.prefix.

WordDelimiterFilterFactory (WDFF) is what's stripping off your non alpha-numeric
characters. Your stopword factory is before WDFF so
anything like be. (notice the period) would NOT be stripped. Then when that
token is passed through WDFF the period disappears. Order matters.

You have preserverOriginal="1" in WDFF, which means the original token
is preserved
intact so "the}" gets changed to two tokens, "the" and "the}".

So you really have to look more closely at your analysis chain, that's
pretty much where
your problems appear to be.

Best,
Erick

On Tue, Jul 5, 2016 at 4:30 PM, Steven White <sw...@gmail.com> wrote:
> Hi Erick,
>
> By TermsCoponent, I think you meant me to try the following?
>
>
> http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the
>
> If so, I tried it and I'm getting 0 hits:
>
>   <response>
>     <lst name="responseHeader">
>       <int name="status">0</int>
>       <int name="QTime">0</int>
>     </lst>
>     <lst name="terms"/>
>   </response>
>
> In fact, I'm getting 0 hits on anything I pass to "terms.prefix"
>
> Another thing I noticed is this.  Using Solr Admin Console's Schema
> Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
> Info button, I'm seeing "be" in the list!!  Like so:
>
>   4 localhost
>     abc
>     abc@localhost.com
>     com
>     intern
>     be
>     /intern
>     abclocalhostcom
>     user
>
> I don't understand what I'm looking at here (in the schema browser) or if
> this is at all related to my issue (I'm seeing "be" listed here and
> wandering if it has something to do with my issue).  If I click on any of
> the listed words, I get a hit, but I get 0 hits when I click on "be".
>
> Thanks.
>
> Steve
>
>
> On Tue, Jul 5, 2016 at 7:07 PM, Steven White <sw...@gmail.com> wrote:
>
>> Thanks for the quick reply Erick.
>>
>> Here is the analyzer I'm using:
>>
>>   <fieldType name="all_raw_text" class="solr.TextField"
>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
>> ignoreCase="true"/>
>>       <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
>> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
>> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
>> catenateAll="1" catenateNumbers="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>>       <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>       <filter class="solr.PorterStemFilterFactory"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>
>> If in fact it is my analyzer, what part of it is causing this?  If not,
>> I'm not clear about the "TermsComponent" that you suggested having me look
>> into.  How do I "point" it at my field?  I have zero knowledge about this.
>> Is this something I do from Solr's Admin Console via Schema Browser link?
>>
>> Steve
>>
>>
>> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> My guess is that your field analysis isn't stripping the various non
>>> alpha-num
>>> characters, thus "the]" is actually a token in your index, square bracket
>>> and
>>> all. If that's true, it certainly doesn't match the stopword "the".
>>>
>>> You can check by using the TermsComponent, pointing it at your field
>>> and setting terms.prefix=the
>>>
>>> See:
>>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com>
>>> wrote:
>>> > HI Everyone,
>>> >
>>> > I'm trying to understand why I get a hit when I search for "the}" but
>>> not
>>> > when I search for "the" (searches are done without the quotes and "the"
>>> is
>>> > a stopword in my case).
>>> >
>>> > Here is the debugQuery output using "the}":
>>> >   "debug": {
>>> >     "rawquerystring": "the}",
>>> >     "querystring": "the}",
>>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
>>> > ALL_FIELDS:the))~1.0))/no_coord",
>>> >     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
>>> >     "explain": {
>>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=0)\n",
>>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=0)\n",
>>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=1)\n",
>>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.1137601
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.0625 = fieldNorm(doc=0)\n"
>>> >     },
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > Here is the debugQuery output using "the"
>>> >   "debug": {
>>> >     "rawquerystring": "the",
>>> >     "querystring": "the",
>>> >     "parsedquery": "(+())/no_coord",
>>> >     "parsedquery_toString": "+()",
>>> >     "explain": {},
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > As expected, I get no hits when I search for just "}":
>>> >   "debug": {
>>> >     "rawquerystring": "}",
>>> >     "querystring": "}",
>>> >     "parsedquery":
>>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
>>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
>>> >     "explain": {},
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > In case it matters, I'm also getting a hit when I search for "the." or
>>> > "the]" or "the/" or "the," or "the=" etc.
>>> >
>>> > Thanks in advanced.
>>> >
>>> > Steve
>>>
>>
>>

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Steven White <sw...@gmail.com>.
Hi Erick,

By TermsCoponent, I think you meant me to try the following?


http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the

If so, I tried it and I'm getting 0 hits:

  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">0</int>
    </lst>
    <lst name="terms"/>
  </response>

In fact, I'm getting 0 hits on anything I pass to "terms.prefix"

Another thing I noticed is this.  Using Solr Admin Console's Schema
Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
Info button, I'm seeing "be" in the list!!  Like so:

  4 localhost
    abc
    abc@localhost.com
    com
    intern
    be
    /intern
    abclocalhostcom
    user

I don't understand what I'm looking at here (in the schema browser) or if
this is at all related to my issue (I'm seeing "be" listed here and
wandering if it has something to do with my issue).  If I click on any of
the listed words, I get a hit, but I get 0 hits when I click on "be".

Thanks.

Steve


On Tue, Jul 5, 2016 at 7:07 PM, Steven White <sw...@gmail.com> wrote:

> Thanks for the quick reply Erick.
>
> Here is the analyzer I'm using:
>
>   <fieldType name="all_raw_text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>     <analyzer>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
> ignoreCase="true"/>
>       <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
> catenateAll="1" catenateNumbers="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>       <filter class="solr.PorterStemFilterFactory"/>
>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>     </analyzer>
>
> If in fact it is my analyzer, what part of it is causing this?  If not,
> I'm not clear about the "TermsComponent" that you suggested having me look
> into.  How do I "point" it at my field?  I have zero knowledge about this.
> Is this something I do from Solr's Admin Console via Schema Browser link?
>
> Steve
>
>
> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> My guess is that your field analysis isn't stripping the various non
>> alpha-num
>> characters, thus "the]" is actually a token in your index, square bracket
>> and
>> all. If that's true, it certainly doesn't match the stopword "the".
>>
>> You can check by using the TermsComponent, pointing it at your field
>> and setting terms.prefix=the
>>
>> See:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com>
>> wrote:
>> > HI Everyone,
>> >
>> > I'm trying to understand why I get a hit when I search for "the}" but
>> not
>> > when I search for "the" (searches are done without the quotes and "the"
>> is
>> > a stopword in my case).
>> >
>> > Here is the debugQuery output using "the}":
>> >   "debug": {
>> >     "rawquerystring": "the}",
>> >     "querystring": "the}",
>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
>> > ALL_FIELDS:the))~1.0))/no_coord",
>> >     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
>> >     "explain": {
>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> 0.14220011
>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>> of:\n
>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.078125 = fieldNorm(doc=0)\n",
>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> 0.14220011
>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>> of:\n
>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.078125 = fieldNorm(doc=0)\n",
>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
>> 0.14220011
>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
>> of:\n
>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.078125 = fieldNorm(doc=1)\n",
>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>> 0.1137601
>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
>> of:\n
>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>> >  0.0625 = fieldNorm(doc=0)\n"
>> >     },
>> >     "QParser": "ExtendedDismaxQParser",
>> >     "altquerystring": null,
>> >     "boost_queries": null,
>> >     "parsed_boost_queries": [],
>> >     "boostfuncs": null,
>> >     "filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >     "parsed_filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >
>> > Here is the debugQuery output using "the"
>> >   "debug": {
>> >     "rawquerystring": "the",
>> >     "querystring": "the",
>> >     "parsedquery": "(+())/no_coord",
>> >     "parsedquery_toString": "+()",
>> >     "explain": {},
>> >     "QParser": "ExtendedDismaxQParser",
>> >     "altquerystring": null,
>> >     "boost_queries": null,
>> >     "parsed_boost_queries": [],
>> >     "boostfuncs": null,
>> >     "filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >     "parsed_filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >
>> > As expected, I get no hits when I search for just "}":
>> >   "debug": {
>> >     "rawquerystring": "}",
>> >     "querystring": "}",
>> >     "parsedquery":
>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
>> >     "explain": {},
>> >     "QParser": "ExtendedDismaxQParser",
>> >     "altquerystring": null,
>> >     "boost_queries": null,
>> >     "parsed_boost_queries": [],
>> >     "boostfuncs": null,
>> >     "filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >     "parsed_filter_queries": [
>> >       "ISBN_GROUP_ID:2"
>> >     ],
>> >
>> > In case it matters, I'm also getting a hit when I search for "the." or
>> > "the]" or "the/" or "the," or "the=" etc.
>> >
>> > Thanks in advanced.
>> >
>> > Steve
>>
>
>

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Steven White <sw...@gmail.com>.
Thanks for the quick reply Erick.

Here is the analyzer I'm using:

  <fieldType name="all_raw_text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
ignoreCase="true"/>
      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
catenateAll="1" catenateNumbers="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>

If in fact it is my analyzer, what part of it is causing this?  If not, I'm
not clear about the "TermsComponent" that you suggested having me look
into.  How do I "point" it at my field?  I have zero knowledge about this.
Is this something I do from Solr's Admin Console via Schema Browser link?

Steve


On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <er...@gmail.com>
wrote:

> My guess is that your field analysis isn't stripping the various non
> alpha-num
> characters, thus "the]" is actually a token in your index, square bracket
> and
> all. If that's true, it certainly doesn't match the stopword "the".
>
> You can check by using the TermsComponent, pointing it at your field
> and setting terms.prefix=the
>
> See:
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>
> Best,
> Erick
>
> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com> wrote:
> > HI Everyone,
> >
> > I'm trying to understand why I get a hit when I search for "the}" but not
> > when I search for "the" (searches are done without the quotes and "the"
> is
> > a stopword in my case).
> >
> > Here is the debugQuery output using "the}":
> >   "debug": {
> >     "rawquerystring": "the}",
> >     "querystring": "the}",
> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
> > ALL_FIELDS:the))~1.0))/no_coord",
> >     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
> >     "explain": {
> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=0)\n",
> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=0)\n",
> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
> 0.14220011
> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.078125 = fieldNorm(doc=1)\n",
> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> 0.1137601
> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
> of:\n
> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >  0.0625 = fieldNorm(doc=0)\n"
> >     },
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > Here is the debugQuery output using "the"
> >   "debug": {
> >     "rawquerystring": "the",
> >     "querystring": "the",
> >     "parsedquery": "(+())/no_coord",
> >     "parsedquery_toString": "+()",
> >     "explain": {},
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > As expected, I get no hits when I search for just "}":
> >   "debug": {
> >     "rawquerystring": "}",
> >     "querystring": "}",
> >     "parsedquery": "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
> >     "explain": {},
> >     "QParser": "ExtendedDismaxQParser",
> >     "altquerystring": null,
> >     "boost_queries": null,
> >     "parsed_boost_queries": [],
> >     "boostfuncs": null,
> >     "filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >     "parsed_filter_queries": [
> >       "ISBN_GROUP_ID:2"
> >     ],
> >
> > In case it matters, I'm also getting a hit when I search for "the." or
> > "the]" or "the/" or "the," or "the=" etc.
> >
> > Thanks in advanced.
> >
> > Steve
>

Re: Getting a hit on "the}" but not on "the" or "}"

Posted by Erick Erickson <er...@gmail.com>.
My guess is that your field analysis isn't stripping the various non alpha-num
characters, thus "the]" is actually a token in your index, square bracket and
all. If that's true, it certainly doesn't match the stopword "the".

You can check by using the TermsComponent, pointing it at your field
and setting terms.prefix=the

See:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Jul 5, 2016 at 2:34 PM, Steven White <sw...@gmail.com> wrote:
> HI Everyone,
>
> I'm trying to understand why I get a hit when I search for "the}" but not
> when I search for "the" (searches are done without the quotes and "the" is
> a stopword in my case).
>
> Here is the debugQuery output using "the}":
>   "debug": {
>     "rawquerystring": "the}",
>     "querystring": "the}",
>     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
> ALL_FIELDS:the))~1.0))/no_coord",
>     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
>     "explain": {
>       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
> weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.14220011
> = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product of:\n
>        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.078125 = fieldNorm(doc=0)\n",
>       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
> weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.14220011
> = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product of:\n
>        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.078125 = fieldNorm(doc=0)\n",
>       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
> weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n    0.14220011
> = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
> product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product of:\n
>        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.078125 = fieldNorm(doc=1)\n",
>       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
> weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n    0.1137601
> = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product of:\n
>        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>  0.0625 = fieldNorm(doc=0)\n"
>     },
>     "QParser": "ExtendedDismaxQParser",
>     "altquerystring": null,
>     "boost_queries": null,
>     "parsed_boost_queries": [],
>     "boostfuncs": null,
>     "filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>     "parsed_filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>
> Here is the debugQuery output using "the"
>   "debug": {
>     "rawquerystring": "the",
>     "querystring": "the",
>     "parsedquery": "(+())/no_coord",
>     "parsedquery_toString": "+()",
>     "explain": {},
>     "QParser": "ExtendedDismaxQParser",
>     "altquerystring": null,
>     "boost_queries": null,
>     "parsed_boost_queries": [],
>     "boostfuncs": null,
>     "filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>     "parsed_filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>
> As expected, I get no hits when I search for just "}":
>   "debug": {
>     "rawquerystring": "}",
>     "querystring": "}",
>     "parsedquery": "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
>     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
>     "explain": {},
>     "QParser": "ExtendedDismaxQParser",
>     "altquerystring": null,
>     "boost_queries": null,
>     "parsed_boost_queries": [],
>     "boostfuncs": null,
>     "filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>     "parsed_filter_queries": [
>       "ISBN_GROUP_ID:2"
>     ],
>
> In case it matters, I'm also getting a hit when I search for "the." or
> "the]" or "the/" or "the," or "the=" etc.
>
> Thanks in advanced.
>
> Steve