You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by hariram ravichandran <ha...@gmail.com> on 2017/07/24 14:53:42 UTC

Tokens produced by Shingle filter are not added in the query

I'm using Lucene 4.10.4 and trying to construct (shingles) combinations of
tokens.


Code:

public class CustomAnalyzer extends Analyzer {
    @Override
    protected Analyzer.TokenStreamComponents createComponents(final String
fieldName, final Reader reader) {
        final WhitespaceTokenizer src = new
WhitespaceTokenizer(getVersion(), reader);
        TokenStream tok = new ShingleFilter(src, 2, 3);
        tok = new ClassicFilter(tok);
        tok = new LowerCaseFilter(tok);
//        tok = new
SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
        return new Analyzer.TokenStreamComponents(src, tok);
    }
}

public class Test {
    public static void main(String[] args) throws Exception {
        CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
        String queryStr = "cup board";
        TokenStream ts = new CustomAnalyzer().tokenStream("n", new
StringReader(queryStr));
        ts.reset();
        System.out.println("Tokens are :");
        while (ts.incrementToken()) {
            System.out.print(ts.getAttribute(CharTermAttribute.class) + ",
");
        }
        QueryParser parser = new QueryParser("n", analyzer);
        Query query = null;
        query = parser.parse(queryStr);
        System.out.println("\nQuery is");
        System.out.print(query.toString());
    }
}



> Output:
> Tokens are :
> cup, cup board, board
> Query is n
> n:cup n:board
>

Tokens are printed as expected. And expecting the resulting query to be *n:cup
n:board n:cup board*. But tokens formed by shingle filter are not appended
to the query. I get only *n:cup n:board.* Where is my mistake?

Thanks.

Re: Tokens produced by Shingle filter are not added in the query

Posted by Steve Rowe <sa...@gmail.com>.

hariram,

Until Lucene 6.2, there was no way for the classic query parser to *not* first split on whitespace before sending text to the analyzer.  As a result, filters like ShingleFilter that operate on multiple tokens will only see one token at a time; in your example: first “cup” as the full text to analyze, and then, separately, “board” - ShingleFilter is incapable under those conditions of forming any multi-token synthetic tokens.

For more details see <https://issues.apache.org/jira/browse/LUCENE-2605>.

--
Steve
www.lucidworks.com

> On Jul 24, 2017, at 2:00 PM, hariram ravichandran <ha...@gmail.com> wrote:
> 
> Hi Steve,
>    I'm sorry. That's also CustomAnalyzer.
> 
> public class CustomAnalyzer extends Analyzer {
>>    @Override
>>    protected Analyzer.TokenStreamComponents createComponents(final String
>> fieldName, final Reader reader) {
>>        final WhitespaceTokenizer src = new WhitespaceTokenizer(getVersion(),
>> reader);
>>        TokenStream tok = new ShingleFilter(src, 2, 3);
>>        tok = new ClassicFilter(tok);
>>        tok = new LowerCaseFilter(tok);
>> //        tok = new SynonymFilter(tok,SynonymDictionary.
>> getSynonymMap(),true);
>>        return new Analyzer.TokenStreamComponents(src, tok);
>>    }
>> }
>> 
>> 
> public class Test {
>>    public static void main(String[] args) throws Exception {
>>        CustomAnalyzer analyzer = new CustomAnalyzer();
>>        String queryStr = "cup board";
>>        TokenStream ts = new CustomAnalyzer().tokenStream("n", new
>> StringReader(queryStr));
>>        ts.reset();
>>        System.out.println("Tokens are :");
>>        while (ts.incrementToken()) {
>>            System.out.print(ts.getAttribute(CharTermAttribute.class) +
>> ", ");
>>        }
>>        QueryParser parser = new QueryParser("n", analyzer);
>>        Query query = null;
>>        query = parser.parse(queryStr);
>>        System.out.println("\nQuery is");
>>        System.out.print(query.toString());
>>    }
>> }
> 
> 
> Output:
>> Tokens are :
>> cup, cup board, board
>> Query is n
>> n:cup n:board
>> 
> 
> 
> On Mon, Jul 24, 2017 at 11:08 PM, Steve Rowe <sa...@gmail.com> wrote:
> 
>> Hi hariram,
>> 
>> There may be other problems, but at a minimum you have two different
>> analysis classes here.  You’re printing the output stream from one
>> (CustomSynynymAnalyzer, the source of which is not shown in your email),
>> but constructing a query from a different one (CustomAnalyzer).
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>>> On Jul 24, 2017, at 10:53 AM, hariram ravichandran <
>> hariramravichandar@gmail.com> wrote:
>>> 
>>> I'm using Lucene 4.10.4 and trying to construct (shingles) combinations
>> of
>>> tokens.
>>> 
>>> 
>>> Code:
>>> 
>>> public class CustomAnalyzer extends Analyzer {
>>>   @Override
>>>   protected Analyzer.TokenStreamComponents createComponents(final String
>>> fieldName, final Reader reader) {
>>>       final WhitespaceTokenizer src = new
>>> WhitespaceTokenizer(getVersion(), reader);
>>>       TokenStream tok = new ShingleFilter(src, 2, 3);
>>>       tok = new ClassicFilter(tok);
>>>       tok = new LowerCaseFilter(tok);
>>> //        tok = new
>>> SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
>>>       return new Analyzer.TokenStreamComponents(src, tok);
>>>   }
>>> }
>>> 
>>> public class Test {
>>>   public static void main(String[] args) throws Exception {
>>>       CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
>>>       String queryStr = "cup board";
>>>       TokenStream ts = new CustomAnalyzer().tokenStream("n", new
>>> StringReader(queryStr));
>>>       ts.reset();
>>>       System.out.println("Tokens are :");
>>>       while (ts.incrementToken()) {
>>>           System.out.print(ts.getAttribute(CharTermAttribute.class) +
>> ",
>>> ");
>>>       }
>>>       QueryParser parser = new QueryParser("n", analyzer);
>>>       Query query = null;
>>>       query = parser.parse(queryStr);
>>>       System.out.println("\nQuery is");
>>>       System.out.print(query.toString());
>>>   }
>>> }
>>> 
>>> 
>>> 
>>>> Output:
>>>> Tokens are :
>>>> cup, cup board, board
>>>> Query is n
>>>> n:cup n:board
>>>> 
>>> 
>>> Tokens are printed as expected. And expecting the resulting query to be
>> *n:cup
>>> n:board n:cup board*. But tokens formed by shingle filter are not
>> appended
>>> to the query. I get only *n:cup n:board.* Where is my mistake?
>>> 
>>> Thanks.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Tokens produced by Shingle filter are not added in the query

Posted by hariram ravichandran <ha...@gmail.com>.

Hi Steve,
    I'm sorry. That's also CustomAnalyzer.

public class CustomAnalyzer extends Analyzer {
>     @Override
>     protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader) {
>         final WhitespaceTokenizer src = new WhitespaceTokenizer(getVersion(),
> reader);
>         TokenStream tok = new ShingleFilter(src, 2, 3);
>         tok = new ClassicFilter(tok);
>         tok = new LowerCaseFilter(tok);
> //        tok = new SynonymFilter(tok,SynonymDictionary.
> getSynonymMap(),true);
>         return new Analyzer.TokenStreamComponents(src, tok);
>     }
> }
>
>
public class Test {
>     public static void main(String[] args) throws Exception {
>         CustomAnalyzer analyzer = new CustomAnalyzer();
>         String queryStr = "cup board";
>         TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> StringReader(queryStr));
>         ts.reset();
>         System.out.println("Tokens are :");
>         while (ts.incrementToken()) {
>             System.out.print(ts.getAttribute(CharTermAttribute.class) +
> ", ");
>         }
>         QueryParser parser = new QueryParser("n", analyzer);
>         Query query = null;
>         query = parser.parse(queryStr);
>         System.out.println("\nQuery is");
>         System.out.print(query.toString());
>     }
> }


Output:
> Tokens are :
> cup, cup board, board
> Query is n
> n:cup n:board
>


On Mon, Jul 24, 2017 at 11:08 PM, Steve Rowe <sa...@gmail.com> wrote:

> Hi hariram,
>
> There may be other problems, but at a minimum you have two different
> analysis classes here.  You’re printing the output stream from one
> (CustomSynynymAnalyzer, the source of which is not shown in your email),
> but constructing a query from a different one (CustomAnalyzer).
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 24, 2017, at 10:53 AM, hariram ravichandran <
> hariramravichandar@gmail.com> wrote:
> >
> > I'm using Lucene 4.10.4 and trying to construct (shingles) combinations
> of
> > tokens.
> >
> >
> > Code:
> >
> > public class CustomAnalyzer extends Analyzer {
> >    @Override
> >    protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader) {
> >        final WhitespaceTokenizer src = new
> > WhitespaceTokenizer(getVersion(), reader);
> >        TokenStream tok = new ShingleFilter(src, 2, 3);
> >        tok = new ClassicFilter(tok);
> >        tok = new LowerCaseFilter(tok);
> > //        tok = new
> > SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
> >        return new Analyzer.TokenStreamComponents(src, tok);
> >    }
> > }
> >
> > public class Test {
> >    public static void main(String[] args) throws Exception {
> >        CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
> >        String queryStr = "cup board";
> >        TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> > StringReader(queryStr));
> >        ts.reset();
> >        System.out.println("Tokens are :");
> >        while (ts.incrementToken()) {
> >            System.out.print(ts.getAttribute(CharTermAttribute.class) +
> ",
> > ");
> >        }
> >        QueryParser parser = new QueryParser("n", analyzer);
> >        Query query = null;
> >        query = parser.parse(queryStr);
> >        System.out.println("\nQuery is");
> >        System.out.print(query.toString());
> >    }
> > }
> >
> >
> >
> >> Output:
> >> Tokens are :
> >> cup, cup board, board
> >> Query is n
> >> n:cup n:board
> >>
> >
> > Tokens are printed as expected. And expecting the resulting query to be
> *n:cup
> > n:board n:cup board*. But tokens formed by shingle filter are not
> appended
> > to the query. I get only *n:cup n:board.* Where is my mistake?
> >
> > Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Tokens produced by Shingle filter are not added in the query

Posted by Steve Rowe <sa...@gmail.com>.

Hi hariram,

There may be other problems, but at a minimum you have two different analysis classes here.  You’re printing the output stream from one (CustomSynynymAnalyzer, the source of which is not shown in your email), but constructing a query from a different one (CustomAnalyzer).

--
Steve
www.lucidworks.com

> On Jul 24, 2017, at 10:53 AM, hariram ravichandran <ha...@gmail.com> wrote:
> 
> I'm using Lucene 4.10.4 and trying to construct (shingles) combinations of
> tokens.
> 
> 
> Code:
> 
> public class CustomAnalyzer extends Analyzer {
>    @Override
>    protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader) {
>        final WhitespaceTokenizer src = new
> WhitespaceTokenizer(getVersion(), reader);
>        TokenStream tok = new ShingleFilter(src, 2, 3);
>        tok = new ClassicFilter(tok);
>        tok = new LowerCaseFilter(tok);
> //        tok = new
> SynonymFilter(tok,SynonymDictionary.getSynonymMap(),true);
>        return new Analyzer.TokenStreamComponents(src, tok);
>    }
> }
> 
> public class Test {
>    public static void main(String[] args) throws Exception {
>        CustomSynonymAnalyzer analyzer = new CustomSynonymAnalyzer();
>        String queryStr = "cup board";
>        TokenStream ts = new CustomAnalyzer().tokenStream("n", new
> StringReader(queryStr));
>        ts.reset();
>        System.out.println("Tokens are :");
>        while (ts.incrementToken()) {
>            System.out.print(ts.getAttribute(CharTermAttribute.class) + ",
> ");
>        }
>        QueryParser parser = new QueryParser("n", analyzer);
>        Query query = null;
>        query = parser.parse(queryStr);
>        System.out.println("\nQuery is");
>        System.out.print(query.toString());
>    }
> }
> 
> 
> 
>> Output:
>> Tokens are :
>> cup, cup board, board
>> Query is n
>> n:cup n:board
>> 
> 
> Tokens are printed as expected. And expecting the resulting query to be *n:cup
> n:board n:cup board*. But tokens formed by shingle filter are not appended
> to the query. I get only *n:cup n:board.* Where is my mistake?
> 
> Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org