You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2017/05/22 10:36:04 UTC

[jira] [Created] (LUCENE-7842) WordDelimiterGraphFilter adds an extra position for "foo - bar"

Dawid Weiss created LUCENE-7842:
-----------------------------------

             Summary: WordDelimiterGraphFilter adds an extra position for "foo - bar"
                 Key: LUCENE-7842
                 URL: https://issues.apache.org/jira/browse/LUCENE-7842
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Dawid Weiss
            Priority: Minor


This is odd. We have a WordDelimiterGraphFilter configured with:

GENERATE_WORD_PARTS | PRESERVE_ORIGINAL | GENERATE_NUMBER_PARTS | STEM_ENGLISH_POSSESSIVE

and for this input: "foo - bar" it'd create the following token sequence:
{code}
foo, -, bar
{code}
but with an extra position skip after dash -- see:
{code}
digraph tokens {
  graph [ fontsize=30 labelloc="t" label="" splines=true overlap=false rankdir = "LR" ];
  // A2 paper size
  size = "34.4,16.5";
  edge [ fontname="Helvetica" fontcolor="red" color="#606060" ]
  node [ style="filled" fillcolor="#e8e8f0" shape="Mrecord" fontname="Helvetica" ]

  0 [label="0"]
  -1 [shape=point color=white]
  -1 -> 0 []
  0 -> 1 [ label="foo"]
  1 [label="1"]
  1 -> 2 [ label="-"]
  3 [label="3"]
  2 -> 3 [ style="dotted"]
  3 -> 4 [ label="bar"]
  -2 [shape=point color=white]
  4 -> -2 []
}
{code}

This in turn causes the default Solr's query parser to generate a span query that fails to find the original document.

Am I missing something or is this incorrect?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org