You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ted Sullivan (JIRA)" <ji...@apache.org> on 2015/02/23 22:29:12 UTC

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

    [ https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333833#comment-14333833 ] 

Ted Sullivan edited comment on SOLR-7136 at 2/23/15 9:28 PM:
-------------------------------------------------------------

Yes Ahmet - that is correct, this patch includes a QParserPlugin as a workaround for LUCENE-2605 also mentioned in SOLR-5379. (AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson and submitted as SOLR-4381 is a good solution too.  Note however that the AutoPhrasing parser first solves a problem of tokenizing phrases that represent single entities as single tokens - making the Lucene docID lookup cleaner.  Solutions like SOLR-5379 solve this indirectly and may have different edge cases because not all phrases are meant to represent single entities. For example, generalized phrase processing paradigms like mm or ps may not deal as precisely with phrases that include a multi-term entity with something else like "New York City restaurants". Since it is part of an analysis pipeline, the AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter to solve the multi-term synonym problem but that is an architectural solution. In other words this TokenFilter was not written to solve the multi-term problem - that is a side benefit of what it does, given the nature of Lucene analysis chains. It has other benefits as well just by forcing exact-match semantics on phrases that should be treated as semantic or linguistic entities. It does have the downside of requiring autophrase lists, but so then does synonym processing.


was (Author: tedsullivan):
Yes Ahmet - that is correct, this patch includes a QParserPlugin as a workaround for LUCENE-2605 also mentioned in SOLR-5379. (AutophrasingQParserPlugin) The Query Parser solution published by Nolan Lawson and submitted as SOLR-4381 is a good solution too.  Note however that the AutoPhrasing parser first solves a problem of tokenizing phrases that represent single entities as single tokens - making the Lucene docID lookup cleaner.  Solutions like SOLR-5379 solve this indirectly and may have different edge cases because not all phrases are meant to represent single entities. For example, generalized phrase processing paradigms like mm or ps may not deal as precisely with phrases that include a multi-term entity with something else like "New York City restaurants". Since it is part of an analysis pipeline, the AutophrasingTokenFilter can be used in conjunction with the SynonymTokenFilter to solve the multi-synonym problem but that is an architectural solution. In other words this TokenFilter was not written to solve the multi-term synonym problem - that is a side benefit of what it does given the nature of Lucene analysis chains. It has other benefits as well just by forcing exact-match semantics on phrases that should be treated as semantic or linguistic entities. It does have the downside of requiring autophrase lists, but so then does synonym processing.

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases that represent a single entity to be tokenized in a singular fashion. Adds support for ManagedResources and Query parser auto-phrasing support given Lucene-2605.
> The rationale for this Token Filter and its use in solving the long standing multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org