You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by heaven <ah...@gmail.com> on 2013/08/28 13:16:20 UTC

Help to figure out why query does not match

Hi, please help me figure out what's going on. I have the next field type:

<fieldType name="words_ngram" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="20" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

And the next string indexed:
http://plus.google.com/111950520904110959061/profile

Here is what the analyzer shows:
http://img607.imageshack.us/img607/5074/fn1.png

Then I do the next query:
fq=type:Site&
sort=score desc&
q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile&
fl=* score&
qf=url_words_ngram&
defType=edismax&
start=0&
rows=20&
mm=1

And have no results.

These queries do match:
1. https://plus.google
2. https://plus.google.com
3. 11195052090

And these do not:
1. https://plus.google.com/111950520904110959061/profile
2. 111950520904110959061/profile
3. 111950520904110959061

The reason is that "111950520904110959061" length is 21 when I have max gram
size set to 20. Tried to increase max gram size to 200 and it works, but is
there any way to match given query without doing that? The query analyzer
show there are exact matches at PT, SF and LCF or does it work that way so
in index we have only the output from the last filter factory (ENGTF in my
example)? If so, is there an option to preserve the original tokens also?

So that for maxGramSize="5" and indexed string awesomeness I'd have:
"a", "aw", "awe", "awes", "aweso", "awesomeness"

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Help to figure out why query does not match

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi;

I've made a patch for adding preserveOriginal capability to
EdgeNGramFilterFactory. You can test it from here:
https://issues.apache.org/jira/browse/SOLR-5152 Also I've added a comment
about your problem here:
https://issues.apache.org/jira/browse/SOLR-5332?focusedCommentId=13818593&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13818593that
explains "LowerCaseFilterFactory creates tokens by lowercasing all
letters and dropping non-letters." So non-letters will be dropped "before"
tokens are retrieved by EdgeNGramFilterFactory.

PS: Due to the first message is sent one month ago, I'm adding the link of
previous messages:
http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html

Thanks;
Furkan KAMACI


2013/10/10 heaven <ah...@gmail.com>

> Hi Erick, I am finally got back to this issue.
>
> Here is the wish I've created:
> https://issues.apache.org/jira/browse/SOLR-5332
>
> Best,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967p4094652.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Help to figure out why query does not match

Posted by heaven <ah...@gmail.com>.

Hi Erick, I am finally got back to this issue.

Here is the wish I've created:
https://issues.apache.org/jira/browse/SOLR-5332

Best,
Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967p4094652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Help to figure out why query does not match

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, Certainly only the outputs of the last filter make it into
the index. Consider stopwords being the last filter, you'd expect
stopwords to be removed.

There's nothing that I know of that'll do what you're asking, the
code for ENGTF doesn't have any "preserve original" that I
see. This seems like a useful addition though, you've
done a nice job of characterizing the problem. Want to
raise a JIRA and/or do a patch?

I'd guess your only real short-term workaround would be to
increase the max gram size.

I suppose you could do a copyfield into a field that doesn't
do the n-gramming and search against that too, but that
feels kind of kludgy...

Best,
Erick


On Wed, Aug 28, 2013 at 7:16 AM, heaven <ah...@gmail.com> wrote:

> Hi, please help me figure out what's going on. I have the next field type:
>
> <fieldType name="words_ngram" class="solr.TextField" omitNorms="false">
>   <analyzer type="index">
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> maxGramSize="20" />
>   </analyzer>
>   <analyzer type="query">
>     <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\d\w]+" />
>     <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
> ignoreCase="true" />
>     <filter class="solr.LowerCaseFilterFactory" />
>   </analyzer>
> </fieldType>
>
> And the next string indexed:
> http://plus.google.com/111950520904110959061/profile
>
> Here is what the analyzer shows:
> http://img607.imageshack.us/img607/5074/fn1.png
>
> Then I do the next query:
> fq=type:Site&
> sort=score desc&
> q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile&
> fl=* score&
> qf=url_words_ngram&
> defType=edismax&
> start=0&
> rows=20&
> mm=1
>
> And have no results.
>
> These queries do match:
> 1. https://plus.google
> 2. https://plus.google.com
> 3. 11195052090
>
> And these do not:
> 1. https://plus.google.com/111950520904110959061/profile
> 2. 111950520904110959061/profile
> 3. 111950520904110959061
>
> The reason is that "111950520904110959061" length is 21 when I have max
> gram
> size set to 20. Tried to increase max gram size to 200 and it works, but is
> there any way to match given query without doing that? The query analyzer
> show there are exact matches at PT, SF and LCF or does it work that way so
> in index we have only the output from the last filter factory (ENGTF in my
> example)? If so, is there an option to preserve the original tokens also?
>
> So that for maxGramSize="5" and indexed string awesomeness I'd have:
> "a", "aw", "awe", "awes", "aweso", "awesomeness"
>
> Best,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>