You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2014/02/13 10:25:24 UTC
[jira] [Commented] (SOLR-5722) Add catenateShingles option to
WordDelimiterFilter
[ https://issues.apache.org/jira/browse/SOLR-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900154#comment-13900154 ]
Alan Woodward commented on SOLR-5722:
-------------------------------------
Does the HyphenatedWordsFilter not help in this case?
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-HyphenatedWordsFilter
> Add catenateShingles option to WordDelimiterFilter
> --------------------------------------------------
>
> Key: SOLR-5722
> URL: https://issues.apache.org/jira/browse/SOLR-5722
> Project: Solr
> Issue Type: Improvement
> Reporter: Greg Pendlebury
> Priority: Minor
> Labels: filter, newbie, patch
> Attachments: WDFconcatShingles.patch
>
>
> Apologies if I put this in the wrong spot. I'm attaching a patch (against current trunk) that adds support for a 'catenateShingles' option to the WordDelimiterFilter.
> We (National Library of Australia - NLA) are currently maintaining this as an internal modification to the Filter, but I believe it is generic enough to contribute upstream.
> Description:
> =========
> {code}
> /**
> * NLA Modification to the standard word delimiter to support various
> * hyphenation use cases. Primarily driven by requirements for
> * newspapers where words are often broken across line endings.
> *
> * eg. "hyphenated-surname" is printed printed across a line ending and
> * turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
> *
> * In this scenario the stock filter, with 'catenateAll' turned on, will
> * generate individual tokens plus one combined token, but not
> * sub-tokens like "hyphenated surname" and "hyphenatedsur name".
> *
> * So we add a new 'catenateShingles' to achieve this.
> */
> {code}
> Includes unit tests, and as is noted in one of them CATENATE_WORDS and CATENATE_SHINGLES are logically considered mutually exclusive for sensible usage and can cause duplicate tokens (although they should have the same positions etc).
> I'm happy to work on it more if anyone finds problems with it.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org