You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2009/04/28 20:30:32 UTC

[jira] Created: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Multi-word synonym filter (synonym expansion at indexing time).
---------------------------------------------------------------

                 Key: LUCENE-1622
                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
            Reporter: Dawid Weiss
            Priority: Minor
         Attachments: synonyms.patch

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

- if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;

- there are problems with highlighting the original document when synonym is matched (see unit tests for an example),

- if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Issue Comment Edited: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790 ] 

Earwin Burrfoot edited comment on LUCENE-1622 at 4/28/09 11:50 AM:
-------------------------------------------------------------------

I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms (with index+query rewrite approach) - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries.

      was (Author: earwin):
    I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries.
  
> Multi-word synonym filter (synonym expansion at indexing time).
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;
> - there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dawid Weiss updated LUCENE-1622:
--------------------------------

    Attachment: synonyms.patch

Token filter implementing synonyms. Java 1.5 is required to compile it (I left generics for clarity; if folks really need 1.4 compatibility they can be easily removed of course).

> Multi-word synonym filter (synonym expansion at indexing time).
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;
> - there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time).

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703790#action_12703790 ] 

Earwin Burrfoot commented on LUCENE-1622:
-----------------------------------------

I'll shortly cite my experiences mentioned on the list.

* Injecting "synonym group id" token instead of all tokens for all synonyms in group is a big win with index size and saves you from matching for "big". It also plays better with highlighting (still had to rewrite it to handle all corner cases).
* Properly handling multiword synonyms only on index-side is impossible, you have to dabble in query rewriting (even then low-probability corner cases exist, and you might find extra docs).
* Query expansion is the only absolutely clear way to have multiword synonyms with current Lucene, but it is impractical on any adequate synonym dictionary.
* There is a possible change to the way Lucene indexes tokens+positions to enable fully proper multiword synonyms - adding a notion of 'length' or 'span' to a token, this length should play together with positionIncrement when calculating distance between tokens in phrase/spannear queries.

> Multi-word synonym filter (synonym expansion at indexing time).
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;
> - there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org