You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by 小鱼儿 <ct...@gmail.com> on 2020/01/10 08:24:11 UTC

Question about PhraseQuery's capacity...

I use SmartChineseAnalyzer to do the indexing, and add a document with a
TextField whose value is a long sentence, when anaylized, will get 18 terms.

& then i use the same value to construct a PhraseQuery, setting slop to 2,
and adding the 18 terms concequently...

I expect the search api to find this document, but it returns empty.

Where am i wrong?

Re: Question about PhraseQuery's capacity...

Posted by 小鱼儿 <ct...@gmail.com>.

hi i have filed a issue to lucene-core:
https://issues.apache.org/jira/browse/LUCENE-9130
i just write a test case, and find that BooelanQuery with MUST filter mode
is ok, but PhraseQuery fails

小鱼儿 <ct...@gmail.com> 于2020年1月10日周五 下午7:14写道：

> explain api helps! thanks for hint~!
> I have found out that one case failed becaused i carelessly add another
> filter condition, but the other case (which is analyzed into 30 terms)
> still failed, doesn't know why
> I guess i need to write a unit testcase to use MultiTerms.getTerms API to
> find out if there is any mismatch in analyzer's processing or if there is a
> capacity limit in PhraseQuery...
>
> Mikhail Khludnev <mk...@apache.org> 于2020年1月10日周五 下午6:21写道：
>
>> Hello,
>> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>>
>> On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿 <ct...@gmail.com> wrote:
>>
>> > After i directly call Analyzer.tokenStream() method to extract terms
>> from
>> > query, i still cannot get results. Doesn't know the why...
>> >
>> > Code when build index:
>> >            IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
>> //new
>> > SmartChineseAnalyzer();
>> >
>> > Code do query:
>> > (1) extract terms from query text:
>> >
>> >  public List<String> analysis(String fieldName, String text) {
>> > List<String> terms = new ArrayList<String>();
>> > TokenStream stream = analyzer.tokenStream(fieldName, text);
>> > try {
>> > stream.reset();
>> > while(stream.incrementToken()) {
>> > CharTermAttribute termAtt =
>> stream.getAttribute(CharTermAttribute.class);
>> > String term = termAtt.toString();
>> > terms.add(term);
>> > }
>> > stream.end();
>> > } catch (IOException e) {
>> > e.printStackTrace();
>> > log.error(e.getMessage(), e);
>> > }
>> > return terms;
>> > }
>> >
>> > (2) Code to construct a PhraseQuery:
>> >
>> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
>> String
>> > queryText, int slop) {
>> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
>> > builder.setSlop(2); //? max is 2;
>> > List<String> terms = analyzer.analysis(fieldName, queryText);
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(term);
>> > }
>> > Query query = builder.build();
>> > return query;
>> > }
>> >
>> > Use BooleanQuery also failed:
>> >
>> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
>> > String queryText) {
>> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
>> > List<String> terms = analyzer.analysis(fieldName, queryText);
>> > log.info("terms: "+StringUtils.join(terms, ", "));
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
>> > }
>> > return builder.build();
>> > }
>> >
>> > Adrien Grand <jp...@gmail.com> 于2020年1月10日周五 下午4:53写道：
>> >
>> > > It should match. My guess is that you might not reusing the same
>> > positions
>> > > as set by the analysis chain when creating the phrase query? Can you
>> show
>> > > us how you build the phrase query?
>> > >
>> > > On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:
>> > >
>> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
>> with
>> > a
>> > > > TextField whose value is a long sentence, when anaylized, will get
>> 18
>> > > > terms.
>> > > >
>> > > > & then i use the same value to construct a PhraseQuery, setting
>> slop to
>> > > 2,
>> > > > and adding the 18 terms concequently...
>> > > >
>> > > > I expect the search api to find this document, but it returns empty.
>> > > >
>> > > > Where am i wrong?
>> > > >
>> > >
>> > >
>> > > --
>> > > Adrien
>> > >
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>

Re: Question about PhraseQuery's capacity...

Posted by 小鱼儿 <ct...@gmail.com>.

explain api helps! thanks for hint~!
I have found out that one case failed becaused i carelessly add another
filter condition, but the other case (which is analyzed into 30 terms)
still failed, doesn't know why
I guess i need to write a unit testcase to use MultiTerms.getTerms API to
find out if there is any mismatch in analyzer's processing or if there is a
capacity limit in PhraseQuery...

Mikhail Khludnev <mk...@apache.org> 于2020年1月10日周五 下午6:21写道：

> Hello,
> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>
> On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿 <ct...@gmail.com> wrote:
>
> > After i directly call Analyzer.tokenStream() method to extract terms from
> > query, i still cannot get results. Doesn't know the why...
> >
> > Code when build index:
> >            IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
> > SmartChineseAnalyzer();
> >
> > Code do query:
> > (1) extract terms from query text:
> >
> >  public List<String> analysis(String fieldName, String text) {
> > List<String> terms = new ArrayList<String>();
> > TokenStream stream = analyzer.tokenStream(fieldName, text);
> > try {
> > stream.reset();
> > while(stream.incrementToken()) {
> > CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
> > String term = termAtt.toString();
> > terms.add(term);
> > }
> > stream.end();
> > } catch (IOException e) {
> > e.printStackTrace();
> > log.error(e.getMessage(), e);
> > }
> > return terms;
> > }
> >
> > (2) Code to construct a PhraseQuery:
> >
> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
> String
> > queryText, int slop) {
> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
> > builder.setSlop(2); //? max is 2;
> > List<String> terms = analyzer.analysis(fieldName, queryText);
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(term);
> > }
> > Query query = builder.build();
> > return query;
> > }
> >
> > Use BooleanQuery also failed:
> >
> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
> > String queryText) {
> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
> > List<String> terms = analyzer.analysis(fieldName, queryText);
> > log.info("terms: "+StringUtils.join(terms, ", "));
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
> > }
> > return builder.build();
> > }
> >
> > Adrien Grand <jp...@gmail.com> 于2020年1月10日周五 下午4:53写道：
> >
> > > It should match. My guess is that you might not reusing the same
> > positions
> > > as set by the analysis chain when creating the phrase query? Can you
> show
> > > us how you build the phrase query?
> > >
> > > On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:
> > >
> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
> with
> > a
> > > > TextField whose value is a long sentence, when anaylized, will get 18
> > > > terms.
> > > >
> > > > & then i use the same value to construct a PhraseQuery, setting slop
> to
> > > 2,
> > > > and adding the 18 terms concequently...
> > > >
> > > > I expect the search api to find this document, but it returns empty.
> > > >
> > > > Where am i wrong?
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Question about PhraseQuery's capacity...

Posted by Mikhail Khludnev <mk...@apache.org>.

Hello,
Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.

On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿 <ct...@gmail.com> wrote:

> After i directly call Analyzer.tokenStream() method to extract terms from
> query, i still cannot get results. Doesn't know the why...
>
> Code when build index:
>            IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
> SmartChineseAnalyzer();
>
> Code do query:
> (1) extract terms from query text:
>
>  public List<String> analysis(String fieldName, String text) {
> List<String> terms = new ArrayList<String>();
> TokenStream stream = analyzer.tokenStream(fieldName, text);
> try {
> stream.reset();
> while(stream.incrementToken()) {
> CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
> String term = termAtt.toString();
> terms.add(term);
> }
> stream.end();
> } catch (IOException e) {
> e.printStackTrace();
> log.error(e.getMessage(), e);
> }
> return terms;
> }
>
> (2) Code to construct a PhraseQuery:
>
> private Query buildPhraseQuery(Analyzer analyzer, String fieldName, String
> queryText, int slop) {
> PhraseQuery.Builder builder = new PhraseQuery.Builder();
> builder.setSlop(2); //? max is 2;
> List<String> terms = analyzer.analysis(fieldName, queryText);
> for(String termKeyword : terms) {
> Term term = new Term(fieldName, termKeyword);
> builder.add(term);
> }
> Query query = builder.build();
> return query;
> }
>
> Use BooleanQuery also failed:
>
> private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
> String queryText) {
> BooleanQuery.Builder builder = new BooleanQuery.Builder();
> List<String> terms = analyzer.analysis(fieldName, queryText);
> log.info("terms: "+StringUtils.join(terms, ", "));
> for(String termKeyword : terms) {
> Term term = new Term(fieldName, termKeyword);
> builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
> }
> return builder.build();
> }
>
> Adrien Grand <jp...@gmail.com> 于2020年1月10日周五 下午4:53写道：
>
> > It should match. My guess is that you might not reusing the same
> positions
> > as set by the analysis chain when creating the phrase query? Can you show
> > us how you build the phrase query?
> >
> > On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:
> >
> > > I use SmartChineseAnalyzer to do the indexing, and add a document with
> a
> > > TextField whose value is a long sentence, when anaylized, will get 18
> > > terms.
> > >
> > > & then i use the same value to construct a PhraseQuery, setting slop to
> > 2,
> > > and adding the 18 terms concequently...
> > >
> > > I expect the search api to find this document, but it returns empty.
> > >
> > > Where am i wrong?
> > >
> >
> >
> > --
> > Adrien
> >
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Question about PhraseQuery's capacity...

Posted by 小鱼儿 <ct...@gmail.com>.

After i directly call Analyzer.tokenStream() method to extract terms from
query, i still cannot get results. Doesn't know the why...

Code when build index:
           IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
SmartChineseAnalyzer();

Code do query:
(1) extract terms from query text:

 public List<String> analysis(String fieldName, String text) {
List<String> terms = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(fieldName, text);
try {
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
String term = termAtt.toString();
terms.add(term);
}
stream.end();
} catch (IOException e) {
e.printStackTrace();
log.error(e.getMessage(), e);
}
return terms;
}

(2) Code to construct a PhraseQuery:

private Query buildPhraseQuery(Analyzer analyzer, String fieldName, String
queryText, int slop) {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.setSlop(2); //? max is 2;
List<String> terms = analyzer.analysis(fieldName, queryText);
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(term);
}
Query query = builder.build();
return query;
}

Use BooleanQuery also failed:

private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
String queryText) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
List<String> terms = analyzer.analysis(fieldName, queryText);
log.info("terms: "+StringUtils.join(terms, ", "));
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
}
return builder.build();
}

Adrien Grand <jp...@gmail.com> 于2020年1月10日周五 下午4:53写道：

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>

Re: Question about PhraseQuery's capacity...

Posted by 小鱼儿 <ct...@gmail.com>.

Hi Adrien,
     I find i might make a mistake:
     There is 2 level processing in a Analyzer class: one is Tokenizer,
which is HMMChineseTokenizer, and the other is Analyzer which may apply
some filtering...
     I'm using lucene's default interface to set a Analyzer instance to do
the indexing, but i'm using the Tokenizer to parse raw query text to build
the Query.
     The wierd thing is, there is a lucene query-parser module, but it will
deal with some meta syntax like AND/OR filedName:xxx, so i think it cannot
directly deal with the raw query text?
     But when i try to use the upper Analyzer.tokenStream() to parse
separate terms from raw query text, i get the very confusing api:
TokenStream has no clear interface to get the terms(filtered tokens), but
the Attribute concept, which is used only in lucene internals. Where can i
find a sample code to extract the filtered tokens from the TokenStream
interface?

Adrien Grand <jp...@gmail.com> 于2020年1月10日周五 下午4:53写道：

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>

Re: Question about PhraseQuery's capacity...

Posted by Adrien Grand <jp...@gmail.com>.

It should match. My guess is that you might not reusing the same positions
as set by the analysis chain when creating the phrase query? Can you show
us how you build the phrase query?

On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿 <ct...@gmail.com> wrote:

> I use SmartChineseAnalyzer to do the indexing, and add a document with a
> TextField whose value is a long sentence, when anaylized, will get 18
> terms.
>
> & then i use the same value to construct a PhraseQuery, setting slop to 2,
> and adding the 18 terms concequently...
>
> I expect the search api to find this document, but it returns empty.
>
> Where am i wrong?
>

-- 
Adrien