You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eva Popenda <Ev...@abas.de> on 2016/04/18 17:27:33 UTC
Problem with NGramAnalyzer, PhraseQuery and Highlighter
Hi,
I have a problem when using the Highlighter with N-GramAnalyzer and PhraseQuery:
Searching for a substring with length = N (4 in my case) yields the following exception:
java.lang.IllegalArgumentException: Less than 2 subSpans.size():1
at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:40)
at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56)
at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:232)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:292)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:137)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:506)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:219)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:187)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:196)
Below is a JUnit-Test reproducing this behavior. In case of searching for a string with more than N characters or using NGramPhraseQuery this problem doesn't occur.
Why is it that more than 1 subSpans are required?
public class HighlighterTest {
@Rule
public final ExpectedException exception = ExpectedException.none();
@Test
public void testHighlighterWithPhraseQueryThrowsException() throws IOException, InvalidTokenOffsetsException {
final Analyzer analyzer = new NGramAnalyzer(4);
final String fieldName = "substring";
final List<BytesRef> list = new ArrayList<>();
list.add(new BytesRef("uchu"));
final PhraseQuery query = new PhraseQuery(fieldName, list.toArray(new BytesRef[list.size()]));
final QueryScorer fragmentScorer = new QueryScorer(query, fieldName);
final SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
exception.expect(IllegalArgumentException.class);
exception.expectMessage("Less than 2 subSpans.size():1");
final Highlighter highlighter = new Highlighter(formatter,TextEncoder.NONE.getEncoder(), fragmentScorer);
highlighter.setTextFragmenter(new SimpleFragmenter(100));
final String fragment = highlighter.getBestFragment(analyzer, fieldName, "Buchung");
assertEquals("B<b>uchu</b>ng",fragment);
}
public final class NGramAnalyzer extends Analyzer {
private final int minNGram;
public NGramAnalyzer(final int minNGram) {
super();
this.minNGram = minNGram;
}
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer source = new NGramTokenizer(minNGram, minNGram);
return new TokenStreamComponents(source);
}
}
}
Thanks and cheers,
Eva
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Problem with NGramAnalyzer, PhraseQuery and Highlighter
Posted by Eva Popenda <Ev...@abas.de>.
Hi Alan,
thank you, a jira ticket is opened.
Cheers,
Eva
On 18.04.2016 19:01, Alan Woodward wrote:
> Hi Eva,
>
> This looks like a bug in WeightedSpanTermExtractor, which is rewriting your PhraseQuery into a SpanNearQuery without checking how many terms there are. Could you open a JIRA ticket?
>
> Alan Woodward
> www.flax.co.uk
>
>
>> On 18 Apr 2016, at 16:27, Eva Popenda <Ev...@abas.de> wrote:
>>
>> Hi,
>>
>> I have a problem when using the Highlighter with N-GramAnalyzer and PhraseQuery:
>> Searching for a substring with length = N (4 in my case) yields the following exception:
>>
>> java.lang.IllegalArgumentException: Less than 2 subSpans.size():1
>> at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:40)
>> at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56)
>> at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:232)
>> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:292)
>> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:137)
>> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:506)
>> at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:219)
>> at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:187)
>> at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:196)
>>
>> Below is a JUnit-Test reproducing this behavior. In case of searching for a string with more than N characters or using NGramPhraseQuery this problem doesn't occur.
>> Why is it that more than 1 subSpans are required?
>>
>> public class HighlighterTest {
>>
>> @Rule
>> public final ExpectedException exception = ExpectedException.none();
>>
>> @Test
>> public void testHighlighterWithPhraseQueryThrowsException() throws IOException, InvalidTokenOffsetsException {
>>
>> final Analyzer analyzer = new NGramAnalyzer(4);
>> final String fieldName = "substring";
>>
>> final List<BytesRef> list = new ArrayList<>();
>> list.add(new BytesRef("uchu"));
>> final PhraseQuery query = new PhraseQuery(fieldName, list.toArray(new BytesRef[list.size()]));
>>
>> final QueryScorer fragmentScorer = new QueryScorer(query, fieldName);
>> final SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
>>
>> exception.expect(IllegalArgumentException.class);
>> exception.expectMessage("Less than 2 subSpans.size():1");
>>
>> final Highlighter highlighter = new Highlighter(formatter,TextEncoder.NONE.getEncoder(), fragmentScorer);
>> highlighter.setTextFragmenter(new SimpleFragmenter(100));
>> final String fragment = highlighter.getBestFragment(analyzer, fieldName, "Buchung");
>>
>> assertEquals("B<b>uchu</b>ng",fragment);
>>
>> }
>>
>> public final class NGramAnalyzer extends Analyzer {
>>
>> private final int minNGram;
>>
>> public NGramAnalyzer(final int minNGram) {
>> super();
>> this.minNGram = minNGram;
>> }
>>
>> @Override
>> protected TokenStreamComponents createComponents(final String fieldName) {
>> final Tokenizer source = new NGramTokenizer(minNGram, minNGram);
>> return new TokenStreamComponents(source);
>> }
>>
>> }
>>
>> }
>>
>> Thanks and cheers,
>> Eva
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
--
Eva Popenda | Software-Entwicklerin | Technische Entwicklung
abas Software AG | Gartenstraße 67 | 76135 Karlsruhe | Germany
Web: http://www.abas-software.com | http://www.abas.de
Board of Directors / Vorstand: Michael Baier, Jürgen Nöding, Mario Raatz, Werner Strub
Chairman Board of Directors / Vorstandsvorsitzender: Werner Strub
Chairman Supervisory Board / Aufsichtsratsvorsitzender: Udo Stößer
Registered Office / Sitz der Gesellschaft: Karlsruhe
Commercial Register / Handelsregister: HRB 107644 Amtsgericht Mannheim
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Problem with NGramAnalyzer, PhraseQuery and Highlighter
Posted by Alan Woodward <al...@flax.co.uk>.
Hi Eva,
This looks like a bug in WeightedSpanTermExtractor, which is rewriting your PhraseQuery into a SpanNearQuery without checking how many terms there are. Could you open a JIRA ticket?
Alan Woodward
www.flax.co.uk
> On 18 Apr 2016, at 16:27, Eva Popenda <Ev...@abas.de> wrote:
>
> Hi,
>
> I have a problem when using the Highlighter with N-GramAnalyzer and PhraseQuery:
> Searching for a substring with length = N (4 in my case) yields the following exception:
>
> java.lang.IllegalArgumentException: Less than 2 subSpans.size():1
> at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:40)
> at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56)
> at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:232)
> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:292)
> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:137)
> at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:506)
> at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:219)
> at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:187)
> at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:196)
>
> Below is a JUnit-Test reproducing this behavior. In case of searching for a string with more than N characters or using NGramPhraseQuery this problem doesn't occur.
> Why is it that more than 1 subSpans are required?
>
> public class HighlighterTest {
>
> @Rule
> public final ExpectedException exception = ExpectedException.none();
>
> @Test
> public void testHighlighterWithPhraseQueryThrowsException() throws IOException, InvalidTokenOffsetsException {
>
> final Analyzer analyzer = new NGramAnalyzer(4);
> final String fieldName = "substring";
>
> final List<BytesRef> list = new ArrayList<>();
> list.add(new BytesRef("uchu"));
> final PhraseQuery query = new PhraseQuery(fieldName, list.toArray(new BytesRef[list.size()]));
>
> final QueryScorer fragmentScorer = new QueryScorer(query, fieldName);
> final SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
>
> exception.expect(IllegalArgumentException.class);
> exception.expectMessage("Less than 2 subSpans.size():1");
>
> final Highlighter highlighter = new Highlighter(formatter,TextEncoder.NONE.getEncoder(), fragmentScorer);
> highlighter.setTextFragmenter(new SimpleFragmenter(100));
> final String fragment = highlighter.getBestFragment(analyzer, fieldName, "Buchung");
>
> assertEquals("B<b>uchu</b>ng",fragment);
>
> }
>
> public final class NGramAnalyzer extends Analyzer {
>
> private final int minNGram;
>
> public NGramAnalyzer(final int minNGram) {
> super();
> this.minNGram = minNGram;
> }
>
> @Override
> protected TokenStreamComponents createComponents(final String fieldName) {
> final Tokenizer source = new NGramTokenizer(minNGram, minNGram);
> return new TokenStreamComponents(source);
> }
>
> }
>
> }
>
> Thanks and cheers,
> Eva
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>