You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by hariram ravichandran <ha...@gmail.com> on 2017/07/24 14:53:42 UTC
Tokens produced by Shingle filter are not added in the query
I'm using Lucene 4.10.4 and trying to construct (shingles) combinations of
tokens.
Code:
public class CustomAnalyzer extends Analyzer {
@Override
protected Analyzer.TokenStreamComponents createComponents(final String
fieldName, final Reader reader) {
final WhitespaceTokenizer src = new
WhitespaceTokenizer(getVersion(), reader);
TokenStream tok = new ShingleFilter(src, 2, 3);
tok = new ClassicFilter(tok);
tok = new LowerCaseFilter(tok);
// tok = new
SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
return new Analyzer.TokenStreamComponents(src, tok);
}
}
public class Test {
public static void main(String[] args) throws Exception {
CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
String queryStr = "cup board";
TokenStream ts = new CustomAnalyzer().tokenStream("n", new
StringReader(queryStr));
ts.reset();
System.out.println("Tokens are :");
while (ts.incrementToken()) {
System.out.print(ts.getAttribute(CharTermAttribute.class) + ",
");
}
QueryParser parser = new QueryParser("n", analyzer);
Query query = null;
query = parser.parse(queryStr);
System.out.println("\nQuery is");
System.out.print(query.toString());
}
}
> Output:
> Tokens are :
> cup, cup board, board
> Query is n
> n:cup n:board
>
Tokens are printed as expected. And expecting the resulting query to be *n:cup
n:board n:cup board*. But tokens formed by shingle filter are not appended
to the query. I get only *n:cup n:board.* Where is my mistake?
Thanks.
Re: Tokens produced by Shingle filter are not added in the query
Posted by Steve Rowe <sa...@gmail.com>.
hariram,
Until Lucene 6.2, there was no way for the classic query parser to *not* first split on whitespace before sending text to the analyzer. As a result, filters like ShingleFilter that operate on multiple tokens will only see one token at a time; in your example: first “cup” as the full text to analyze, and then, separately, “board” - ShingleFilter is incapable under those conditions of forming any multi-token synthetic tokens.
For more details see <https://issues.apache.org/jira/browse/LUCENE-2605>.
--
Steve
www.lucidworks.com
> On Jul 24, 2017, at 2:00 PM, hariram ravichandran <ha...@gmail.com> wrote:
>
> Hi Steve,
> I'm sorry. That's also CustomAnalyzer.
>
> public class CustomAnalyzer extends Analyzer {
>> @Override
>> protected Analyzer.TokenStreamComponents createComponents(final String
>> fieldName, final Reader reader) {
>> final WhitespaceTokenizer src = new WhitespaceTokenizer(getVersion(),
>> reader);
>> TokenStream tok = new ShingleFilter(src, 2, 3);
>> tok = new ClassicFilter(tok);
>> tok = new LowerCaseFilter(tok);
>> // tok = new SynonymFilter(tok,SynonymDictionary.
>> getSynonymMap(),true);
>> return new Analyzer.TokenStreamComponents(src, tok);
>> }
>> }
>>
>>
> public class Test {
>> public static void main(String[] args) throws Exception {
>> CustomAnalyzer analyzer = new CustomAnalyzer();
>> String queryStr = "cup board";
>> TokenStream ts = new CustomAnalyzer().tokenStream("n", new
>> StringReader(queryStr));
>> ts.reset();
>> System.out.println("Tokens are :");
>> while (ts.incrementToken()) {
>> System.out.print(ts.getAttribute(CharTermAttribute.class) +
>> ", ");
>> }
>> QueryParser parser = new QueryParser("n", analyzer);
>> Query query = null;
>> query = parser.parse(queryStr);
>> System.out.println("\nQuery is");
>> System.out.print(query.toString());
>> }
>> }
>
>
> Output:
>> Tokens are :
>> cup, cup board, board
>> Query is n
>> n:cup n:board
>>
>
>
> On Mon, Jul 24, 2017 at 11:08 PM, Steve Rowe <sa...@gmail.com> wrote:
>
>> Hi hariram,
>>
>> There may be other problems, but at a minimum you have two different
>> analysis classes here. You’re printing the output stream from one
>> (CustomSynynymAnalyzer, the source of which is not shown in your email),
>> but constructing a query from a different one (CustomAnalyzer).
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>>> On Jul 24, 2017, at 10:53 AM, hariram ravichandran <
>> hariramravichandar@gmail.com> wrote:
>>>
>>> I'm using Lucene 4.10.4 and trying to construct (shingles) combinations
>> of
>>> tokens.
>>>
>>>
>>> Code:
>>>
>>> public class CustomAnalyzer extends Analyzer {
>>> @Override
>>> protected Analyzer.TokenStreamComponents createComponents(final String
>>> fieldName, final Reader reader) {
>>> final WhitespaceTokenizer src = new
>>> WhitespaceTokenizer(getVersion(), reader);
>>> TokenStream tok = new ShingleFilter(src, 2, 3);
>>> tok = new ClassicFilter(tok);
>>> tok = new LowerCaseFilter(tok);
>>> // tok = new
>>> SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
>>> return new Analyzer.TokenStreamComponents(src, tok);
>>> }
>>> }
>>>
>>> public class Test {
>>> public static void main(String[] args) throws Exception {
>>> CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
>>> String queryStr = "cup board";
>>> TokenStream ts = new CustomAnalyzer().tokenStream("n", new
>>> StringReader(queryStr));
>>> ts.reset();
>>> System.out.println("Tokens are :");
>>> while (ts.incrementToken()) {
>>> System.out.print(ts.getAttribute(CharTermAttribute.class) +
>> ",
>>> ");
>>> }
>>> QueryParser parser = new QueryParser("n", analyzer);
>>> Query query = null;
>>> query = parser.parse(queryStr);
>>> System.out.println("\nQuery is");
>>> System.out.print(query.toString());
>>> }
>>> }
>>>
>>>
>>>
>>>> Output:
>>>> Tokens are :
>>>> cup, cup board, board
>>>> Query is n
>>>> n:cup n:board
>>>>
>>>
>>> Tokens are printed as expected. And expecting the resulting query to be
>> *n:cup
>>> n:board n:cup board*. But tokens formed by shingle filter are not
>> appended
>>> to the query. I get only *n:cup n:board.* Where is my mistake?
>>>
>>> Thanks.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Tokens produced by Shingle filter are not added in the query
Posted by hariram ravichandran <ha...@gmail.com>.
Hi Steve,
I'm sorry. That's also CustomAnalyzer.
public class CustomAnalyzer extends Analyzer {
> @Override
> protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader) {
> final WhitespaceTokenizer src = new WhitespaceTokenizer(getVersion(),
> reader);
> TokenStream tok = new ShingleFilter(src, 2, 3);
> tok = new ClassicFilter(tok);
> tok = new LowerCaseFilter(tok);
> // tok = new SynonymFilter(tok,SynonymDictionary.
> getSynonymMap(),true);
> return new Analyzer.TokenStreamComponents(src, tok);
> }
> }
>
>
public class Test {
> public static void main(String[] args) throws Exception {
> CustomAnalyzer analyzer = new CustomAnalyzer();
> String queryStr = "cup board";
> TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> StringReader(queryStr));
> ts.reset();
> System.out.println("Tokens are :");
> while (ts.incrementToken()) {
> System.out.print(ts.getAttribute(CharTermAttribute.class) +
> ", ");
> }
> QueryParser parser = new QueryParser("n", analyzer);
> Query query = null;
> query = parser.parse(queryStr);
> System.out.println("\nQuery is");
> System.out.print(query.toString());
> }
> }
Output:
> Tokens are :
> cup, cup board, board
> Query is n
> n:cup n:board
>
On Mon, Jul 24, 2017 at 11:08 PM, Steve Rowe <sa...@gmail.com> wrote:
> Hi hariram,
>
> There may be other problems, but at a minimum you have two different
> analysis classes here. You’re printing the output stream from one
> (CustomSynynymAnalyzer, the source of which is not shown in your email),
> but constructing a query from a different one (CustomAnalyzer).
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 24, 2017, at 10:53 AM, hariram ravichandran <
> hariramravichandar@gmail.com> wrote:
> >
> > I'm using Lucene 4.10.4 and trying to construct (shingles) combinations
> of
> > tokens.
> >
> >
> > Code:
> >
> > public class CustomAnalyzer extends Analyzer {
> > @Override
> > protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader) {
> > final WhitespaceTokenizer src = new
> > WhitespaceTokenizer(getVersion(), reader);
> > TokenStream tok = new ShingleFilter(src, 2, 3);
> > tok = new ClassicFilter(tok);
> > tok = new LowerCaseFilter(tok);
> > // tok = new
> > SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
> > return new Analyzer.TokenStreamComponents(src, tok);
> > }
> > }
> >
> > public class Test {
> > public static void main(String[] args) throws Exception {
> > CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
> > String queryStr = "cup board";
> > TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> > StringReader(queryStr));
> > ts.reset();
> > System.out.println("Tokens are :");
> > while (ts.incrementToken()) {
> > System.out.print(ts.getAttribute(CharTermAttribute.class) +
> ",
> > ");
> > }
> > QueryParser parser = new QueryParser("n", analyzer);
> > Query query = null;
> > query = parser.parse(queryStr);
> > System.out.println("\nQuery is");
> > System.out.print(query.toString());
> > }
> > }
> >
> >
> >
> >> Output:
> >> Tokens are :
> >> cup, cup board, board
> >> Query is n
> >> n:cup n:board
> >>
> >
> > Tokens are printed as expected. And expecting the resulting query to be
> *n:cup
> > n:board n:cup board*. But tokens formed by shingle filter are not
> appended
> > to the query. I get only *n:cup n:board.* Where is my mistake?
> >
> > Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Tokens produced by Shingle filter are not added in the query
Posted by Steve Rowe <sa...@gmail.com>.
Hi hariram,
There may be other problems, but at a minimum you have two different analysis classes here. You’re printing the output stream from one (CustomSynynymAnalyzer, the source of which is not shown in your email), but constructing a query from a different one (CustomAnalyzer).
--
Steve
www.lucidworks.com
> On Jul 24, 2017, at 10:53 AM, hariram ravichandran <ha...@gmail.com> wrote:
>
> I'm using Lucene 4.10.4 and trying to construct (shingles) combinations of
> tokens.
>
>
> Code:
>
> public class CustomAnalyzer extends Analyzer {
> @Override
> protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader) {
> final WhitespaceTokenizer src = new
> WhitespaceTokenizer(getVersion(), reader);
> TokenStream tok = new ShingleFilter(src, 2, 3);
> tok = new ClassicFilter(tok);
> tok = new LowerCaseFilter(tok);
> // tok = new
> SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
> return new Analyzer.TokenStreamComponents(src, tok);
> }
> }
>
> public class Test {
> public static void main(String[] args) throws Exception {
> CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
> String queryStr = "cup board";
> TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> StringReader(queryStr));
> ts.reset();
> System.out.println("Tokens are :");
> while (ts.incrementToken()) {
> System.out.print(ts.getAttribute(CharTermAttribute.class) + ",
> ");
> }
> QueryParser parser = new QueryParser("n", analyzer);
> Query query = null;
> query = parser.parse(queryStr);
> System.out.println("\nQuery is");
> System.out.print(query.toString());
> }
> }
>
>
>
>> Output:
>> Tokens are :
>> cup, cup board, board
>> Query is n
>> n:cup n:board
>>
>
> Tokens are printed as expected. And expecting the resulting query to be *n:cup
> n:board n:cup board*. But tokens formed by shingle filter are not appended
> to the query. I get only *n:cup n:board.* Where is my mistake?
>
> Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org