You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2017/05/22 10:36:04 UTC
[jira] [Created] (LUCENE-7842) WordDelimiterGraphFilter adds an
extra position for "foo - bar"
Dawid Weiss created LUCENE-7842:
-----------------------------------
Summary: WordDelimiterGraphFilter adds an extra position for "foo - bar"
Key: LUCENE-7842
URL: https://issues.apache.org/jira/browse/LUCENE-7842
Project: Lucene - Core
Issue Type: Bug
Reporter: Dawid Weiss
Priority: Minor
This is odd. We have a WordDelimiterGraphFilter configured with:
GENERATE_WORD_PARTS | PRESERVE_ORIGINAL | GENERATE_NUMBER_PARTS | STEM_ENGLISH_POSSESSIVE
and for this input: "foo - bar" it'd create the following token sequence:
{code}
foo, -, bar
{code}
but with an extra position skip after dash -- see:
{code}
digraph tokens {
graph [ fontsize=30 labelloc="t" label="" splines=true overlap=false rankdir = "LR" ];
// A2 paper size
size = "34.4,16.5";
edge [ fontname="Helvetica" fontcolor="red" color="#606060" ]
node [ style="filled" fillcolor="#e8e8f0" shape="Mrecord" fontname="Helvetica" ]
0 [label="0"]
-1 [shape=point color=white]
-1 -> 0 []
0 -> 1 [ label="foo"]
1 [label="1"]
1 -> 2 [ label="-"]
3 [label="3"]
2 -> 3 [ style="dotted"]
3 -> 4 [ label="bar"]
-2 [shape=point color=white]
4 -> -2 []
}
{code}
This in turn causes the default Solr's query parser to generate a span query that fails to find the original document.
Am I missing something or is this incorrect?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org