You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Simon Endele (JIRA)" <ji...@apache.org> on 2015/03/02 18:25:04 UTC

[jira] [Commented] (SOLR-5332) Add "preserve original" setting to the EdgeNGramFilterFactory

    [ https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343414#comment-14343414 ] 

Simon Endele commented on SOLR-5332:
------------------------------------

+1 for this feature.
We use the EdgeNGramFilterFactory on a tokenized field (in order to implement a "prefix search" on index time) with minGramSize="3".
Unfortunately we observed that tokens with length 1 or 2 are actually deleted, unexpectedly from our point of view.

Using a second field (though complicated IMHO) would address query-issues, but it gets awkward when it comes to highlighting or phrase searches.
For instance when searching for "us rep"
- the field with EdgeNGramFilterFactory highlights "rep" in "representative", but not "US" as this token has been removed,
- the field without EdgeNGramFilterFactory highlights "US", but not "representative" as it has no prefixes indexed.

Bringing these highlightings together in one string is a quite complex task.
Not speaking of a phrase search, which does not work at all for the example above.

We use minGramSize="3" to reduce collisions of prefixes and abbreviations (like "US" and "usage") and reduce the index size.
I admit, this does not prevent all collisions (e.g. "USA" still collides with "usage"), but it's a compromise.

Nevertheless, minGramSize is a nice feature of EdgeNGramFilterFactory, but it lacks a "preserveOriginal" flag IMO.

> Add "preserve original" setting to the EdgeNGramFilterFactory
> -------------------------------------------------------------
>
>                 Key: SOLR-5332
>                 URL: https://issues.apache.org/jira/browse/SOLR-5332
>             Project: Solr
>          Issue Type: Wish
>    Affects Versions: 4.4, 4.5, 4.5.1, 4.6
>            Reporter: Alexander S.
>
> Hi, as described here: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html the problem is in that if you have these 2 strings to index:
> 1. facebook.com/someuser.1
> 2. facebook.com/someveryandverylongusername
> and the edge ngram filter factory with min and max gram size settings 2 and 25, search requests for these urls will fail.
> But search requests for:
> 1. facebook.com/someuser
> 2. facebook.com/someveryandverylonguserna
> will work properly.
> It's because first url has "1" at the end, which is lover than the allowed min gram size. In the second url the user name is longer than the max gram size (27 characters).
> Would be good to have a "preserve original" option, that will add the original string to the index if it does not fit the allowed gram size, so that "1" and "someveryandverylongusername" tokens will also be added to the index.
> Best,
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org