You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2014/02/13 10:25:24 UTC

[jira] [Commented] (SOLR-5722) Add catenateShingles option to WordDelimiterFilter

    [ https://issues.apache.org/jira/browse/SOLR-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900154#comment-13900154 ] 

Alan Woodward commented on SOLR-5722:
-------------------------------------

Does the HyphenatedWordsFilter not help in this case?

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-HyphenatedWordsFilter

> Add catenateShingles option to WordDelimiterFilter
> --------------------------------------------------
>
>                 Key: SOLR-5722
>                 URL: https://issues.apache.org/jira/browse/SOLR-5722
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Greg Pendlebury
>            Priority: Minor
>              Labels: filter, newbie, patch
>         Attachments: WDFconcatShingles.patch
>
>
> Apologies if I put this in the wrong spot. I'm attaching a patch (against current trunk) that adds support for a 'catenateShingles' option to the WordDelimiterFilter. 
> We (National Library of Australia - NLA) are currently maintaining this as an internal modification to the Filter, but I believe it is generic enough to contribute upstream.
> Description:
> =========
> {code}
> /**
>  * NLA Modification to the standard word delimiter to support various
>  * hyphenation use cases. Primarily driven by requirements for
>  * newspapers where words are often broken across line endings.
>  *
>  *  eg. "hyphenated-surname" is printed printed across a line ending and
>  *         turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
>  *
>  *  In this scenario the stock filter, with 'catenateAll' turned on, will
>  *  generate individual tokens plus one combined token, but not
>  *  sub-tokens like "hyphenated surname" and "hyphenatedsur name".
>  *
>  *  So we add a new 'catenateShingles' to achieve this.
> */
> {code}
> Includes unit tests, and as is noted in one of them CATENATE_WORDS and CATENATE_SHINGLES are logically considered mutually exclusive for sensible usage and can cause duplicate tokens (although they should have the same positions etc).
> I'm happy to work on it more if anyone finds problems with it.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org