You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Koorosh Vakhshoori (JIRA)" <ji...@apache.org> on 2015/11/19 21:14:11 UTC

[jira] [Updated] (SOLR-7136) Add an AutoPhrasing TokenFilter

     [ https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koorosh Vakhshoori updated SOLR-7136:
-------------------------------------
    Attachment: SOLR-7136.patch
                AutoPhaseFiniteStateDiagram.pdf

Here I am uploading a new implementation of AutoPhrasing in coordination with Ted. This version adds a few features on top of the previous code. Here they are:
- The phrase detection algorithm is refactored as a finite-state machine. This FSM takes a term as input for each transition. I am including the FSM diagram here.
- The new code correctly keeps track of the start and end offsets in all cases.
- Now the code records the PostionLength attribute, since it would be handy for highlighter. That is once the highlighter is fixed, SOLR-3390.
- There is a new argument ‘emitAmbiguousPhrases’. When it is set to false, it would only emit auto-phrase that matches the longest sequence of terms. For example, if we have ‘New York City’ and ‘New York’ in the autophrases.txt file and the text is ‘New York City is a great place to live’, in this case only ‘New York City’ is emitted. Well, my use case required it and I am sure others may want it too.
- Rather than applying AutoPhrasing at index time, now you can detect it at query time by setting ‘quotePhrase’ to true. This is a major enhancement, no need to do anything special at index time, now the queryParser would just double quote the detected phrase and run the search as a phrase query. Another advantage is you can update the autophrases.txt file on the fly, no need to re-index.
- Updated the queryParser so it would not touch any term in quoted string, since it would be interfering with user’s intend. For example, in query ‘we are going to “New York airport”’ the phrase “new York airport” is untouched.
Side note, as far as comparing SOLR-4381 patch and this one, in my opinion they are complementary not competing. I did some experimentation by chaining AutoPhrasing and Query-time Synonym as a queryParser. They work well together, where one detected the phrases and the other one expanded the query to its synonyms. However, one issue I found was around acronyms in synonym list. For example, DC stands for ‘Direct Current’. If the index text has DC in it, searching for ‘Current’ would not match DC, since the indexed document has not expanded the term to ‘Direct Current’.


> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases that represent a single entity to be tokenized in a singular fashion. Adds support for ManagedResources and Query parser auto-phrasing support given LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org