You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ted Sullivan (JIRA)" <ji...@apache.org> on 2015/11/20 17:13:11 UTC

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

    [ https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018207#comment-15018207 ] 

Ted Sullivan edited comment on SOLR-7136 at 11/20/15 4:12 PM:
--------------------------------------------------------------

Thanks for this submission [~kvakhshoori@gmail.com]! I think that this really helps to scale the autophrasing solution. Also the improvement in dealing with PositionLength is a big plus, as are the improvements in the query parser. Great work, thanks.

I have seen some reports on the github version of my code about memory leaks. Have you looked at that? I will take your patch and try to do some A/B comparisons on this to see if the new FSM implementation (hopefully) removes that problem too. But in general, have you done any performance/scaling tests on your version of the autophrasing filter? Obviously, this goes along with the production-readiness that your new implementation makes possible. Thanks again for submitting this patch.

As to complementarity with SOLR-4381 - I would agree - nice to hear that the two solutions play nicely with each other :) IMO this is an important problem that needs a committed solution. If we give Solr users more than one way to "skin the cat" - the better the chance that they will find a solution for their own problem set.  

As to the acronym 'DC' - this is also ambiguous because it also stands for "District of Columbia" - certainly domain context will clear this up some but not if you have a global search problem like Google or Bing. I'll look into this problem too.


was (Author: tedsullivan):
Thanks for this submission [~kvakhshoori@gmail.com]! I think that this really helps to scale the autophrasing solution. Also the improvement in dealing with PositionLength is a big plus, as are the improvements in the query parser. Great work, thanks.

I have seen some reports on the github version of my code about memory leaks. Have you looked at that? I will take your patch and try to do some A/B comparisons on this to see if the new FSM implementation (hopefully) removes that problem too. But in general, have you done any performance/scaling tests on your version of the autofilter? Obviously, this goes along with the production-readiness that your new implementation makes possible. Thanks again for submitting this patch.

As to complementarity with SOLR-4381 - I would agree - nice to hear that the two solutions play nicely with each other :) IMO this is an important problem that needs a committed solution. If we give Solr users more than one way to "skin the cat" - the better the chance that they will find a solution for their own problem set.  

As to the acronym 'DC' - this is also ambiguous because it also stands for "District of Columbia" - certainly domain context will clear this up some but not if you have a global search problem like Google or Bing. I'll look into this problem too.

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases that represent a single entity to be tokenized in a singular fashion. Adds support for ManagedResources and Query parser auto-phrasing support given LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org