You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Kelvyn Scrupps <Ke...@alliescomputing.com> on 2018/03/29 19:48:25 UTC

WordDelimiterGraphFilter expected behaviour ?

Hi

First posting to list, but here goes .

I'm using WordDelimiterGraphFilter on a field and came across a curious additional positional "hole" generated by the filter while playing with the analysis tool.  
For input "wibble , wobble" (space either side of the comma so it's a separate token), the output introduces an additional positional hole after the comma, i.e. 

Term   position
Wibble 1
,  2
Wobble  4 *

The positionlength for each is 1, so no obvious graph-span going on.

Its not just comma, any punctuation would do, e.g. "wibble ! wobble"

I know it's a bit contrived, and it doesn't break anything in production but it just puzzled me.  

The question is - is this by design ?.  Its not the behaviour of the old WordDelimiterFilter filter.  

Setup:

Solr 6.6.3

Field:
<fieldType name="text_en_allies" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" splitOnNumerics="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1" stemEnglishPossessive="1"/>
     ...
      </analyzer>

Thanks for any insight.

Kelvyn Scrupps
Developer for Allies Computing 

 
 
 


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service 
(http://www.symanteccloud.com) for Allies Computing Ltd
______________________________________________________________________

RE: WordDelimiterGraphFilter expected behaviour ?

Posted by Kelvyn Scrupps <Ke...@alliescomputing.com>.
It's been a holiday here in the UK, hence the delay, but thank you for your far more prompt response.  

It makes sense that the filter is removing the punctuation-only term, and that it only looks odd when alongside the original with preserveOriginal=true.  Fortunately it was just a curio that came up while I was testing a downstream (and typically flaky) custom filter I'm working on that gets it's own positional increments in a twist, otherwise I don't think I'd have noticed it.  We don't - or shouldn't - actually send punctuation-only tokens, so its not really a production concern.  

Thanks for the reminder about FlattenGraphFilterFactory too btw.

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: 29 March 2018 22:59
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterGraphFilter expected behaviour ?

On 3/29/2018 1:48 PM, Kelvyn Scrupps wrote:
> I'm using WordDelimiterGraphFilter on a field and came across a curious additional positional "hole" generated by the filter while playing with the analysis tool.  
> For input "wibble , wobble" (space either side of the comma so it's a separate token), the output introduces an additional positional hole after the comma, i.e. 
>
> Term   position
> Wibble 1
> ,  2
> Wobble  4 *
>
> The positionlength for each is 1, so no obvious graph-span going on.
>
> Its not just comma, any punctuation would do, e.g. "wibble ! wobble"

The wrinkle here is enabling preserveOriginal at the same time that you have a term which is completely removed by the filter (in this case, the comma).  If preserveOriginal is disabled, they both behave the same.  I don't know if this is a bug or not.  My instinct is to say it's a bug, but it's possible that this is expected.

Having a term that's just a punctuation character in the index is generally not very useful ... but there are OTHER situations with this filter where preserveOriginal *is* the behavior you want.  I would imagine that as long as you don't have terms that completely disappear when the filter runs, it would behave correctly.  Try replacing the ","
with "x," to see what I mean.

Also, FYI, when using a Graph filter, the index analysis chain must also have this filter (but not the query analysis):

        <filter class="solr.FlattenGraphFilterFactory"/>

Adding that didn't seem to fix the behavior that concerns you, but the docs do say it's required on the index analysis whenever using a Graph filter.

Thanks,
Shawn


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service
(http://www.symanteccloud.com) for Allies Computing Ltd ______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service 
(http://www.symanteccloud.com) for Allies Computing Ltd
______________________________________________________________________

Re: WordDelimiterGraphFilter expected behaviour ?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 3/29/2018 1:48 PM, Kelvyn Scrupps wrote:
> I'm using WordDelimiterGraphFilter on a field and came across a curious additional positional "hole" generated by the filter while playing with the analysis tool.  
> For input "wibble , wobble" (space either side of the comma so it's a separate token), the output introduces an additional positional hole after the comma, i.e. 
>
> Term   position
> Wibble 1
> ,  2
> Wobble  4 *
>
> The positionlength for each is 1, so no obvious graph-span going on.
>
> Its not just comma, any punctuation would do, e.g. "wibble ! wobble"

The wrinkle here is enabling preserveOriginal at the same time that you
have a term which is completely removed by the filter (in this case, the
comma).  If preserveOriginal is disabled, they both behave the same.  I
don't know if this is a bug or not.  My instinct is to say it's a bug,
but it's possible that this is expected.

Having a term that's just a punctuation character in the index is
generally not very useful ... but there are OTHER situations with this
filter where preserveOriginal *is* the behavior you want.  I would
imagine that as long as you don't have terms that completely disappear
when the filter runs, it would behave correctly.  Try replacing the ","
with "x," to see what I mean.

Also, FYI, when using a Graph filter, the index analysis chain must also
have this filter (but not the query analysis):

        <filter class="solr.FlattenGraphFilterFactory"/>

Adding that didn't seem to fix the behavior that concerns you, but the
docs do say it's required on the index analysis whenever using a Graph
filter.

Thanks,
Shawn