You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peyman Faratin <pe...@robustlinks.com> on 2011/10/11 16:25:29 UTC

Shingles Filter problems

Hi

I have the following shinglefilter (Lucene 3.2)

	  public TokenStream tokenStream(String fieldName, Reader reader) {
		  StandardTokenizer first = new StandardTokenizer(Version.LUCENE_32, reader);
		  StandardFilter second = new StandardFilter(Version.LUCENE_32,first);
		  LowerCaseFilter third = new LowerCaseFilter(Version.LUCENE_32,second);
		  StopFilter fourth = new StopFilter(Version.LUCENE_32,third,Stopwords);
		  PositionFilter fifth = new PositionFilter(fourth);
		  ShingleFilter filter = new ShingleFilter(fifth,shingleSize);		  
		   return filter;
		}

that produces the following token stream given sentence

"please parse this sentence into a shingle of size 2. I'll pay $2 for it"

1: [_ parse:7->12:shingle] 
2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle] 
3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle] 
4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle] 
5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle] 
6: [2:50->51:<NUM>] [2 pay:50->61:shingle] 
7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle] 
8: [2:63->64:<NUM>] 

The query analyzer produces the following analyzed query for the field "titleShingled" for above sentence: 

...... analyzed query:titleShingled:parse titleShingled:sentence titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay titleShingled:2

As you can see there is no bigram singles in the query. I tried removing the unigrams from the token stream (using  filter.setOutputUnigrams(false) in above shingles filter) but even though the singles seem to be fine the query is empty


1: [_ parse:7->12:shingle] 
2: [parse sentence:7->26:shingle] 
3: [sentence shingle:18->41:shingle] 
4: [shingle size:34->49:shingle] 
5: [size 2:45->51:shingle] 
6: [2 pay:50->61:shingle] 
7: [pay 2:58->64:shingle] 

...... analyzed query: 

My goal is to index both unigrams and bigrams but first try to search on bigrams. I think it is the queryparser that is parsing the shingles in a manner that I am not understanding properly. 

		  QueryParser parser = new QueryParser(Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords));

Any help would be very much appreciated

Peyman


Re: Shingles Filter problems

Posted by Peyman Faratin <pe...@robustlinks.com>.
Hi Ian

i think i found out the problem (from tests here http://www.devdaily.com/java/jwarehouse/lucene/contrib/analyzers/common/src/test/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapperTest.java.shtml)

if you generate the query as a BooleanQuery then it seems to work. The following works:

	BooleanQuery query = getShingleBooleanQuery(analyzer,title,fieldToSearch);
	TopDocs hits = searcher.search(query, 10);

where
	  private static BooleanQuery getShingleBooleanQuery(Analyzer analyzer, String qs, String fieldToSearch) throws Exception {

	    BooleanQuery q = new BooleanQuery();
	    TokenStream ts = analyzer.tokenStream(fieldToSearch,new StringReader(qs));
	    CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
	    ts.reset();
	    
	    while (ts.incrementToken()) {
	      String termText =  termAtt.toString();
	      q.add(new TermQuery(new Term(fieldToSearch, termText)),BooleanClause.Occur.SHOULD);
	    }
		System.out.println("... parsed query: " + q);
		return q;
	  }

Thank you (again) for your help

Peyman

On Oct 11, 2011, at 3:51 PM, Ian Lea wrote:

> Something does appear dodgy here.  Using 3.4.0 the following very
> simple code, with no custom classes
> 
> 	ShingleAnalyzerWrapper saw = new ShingleAnalyzerWrapper(LUCENE_34);
> 	QueryParser qp = new QueryParser(LUCENE_34, "t", saw);
> 	String s = "simple sentences rule";
> 	Query q = qp.parse(s);
> 	System.out.printf("%s parsed to %s\n", s, q);
> 
> produces
> 
> simple sentences rule parsed to t:simple t:sentences t:rule
> 
> Like you, I would have expected there to be some shingles in there.
> Are we both missing something?
> 
> 
> --
> Ian.
> 
> 
> On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <pe...@robustlinks.com> wrote:
>> Hi
>> 
>> I have the following shinglefilter (Lucene 3.2)
>> 
>>          public TokenStream tokenStream(String fieldName, Reader reader) {
>>                  StandardTokenizer first = new StandardTokenizer(Version.LUCENE_32, reader);
>>                  StandardFilter second = new StandardFilter(Version.LUCENE_32,first);
>>                  LowerCaseFilter third = new LowerCaseFilter(Version.LUCENE_32,second);
>>                  StopFilter fourth = new StopFilter(Version.LUCENE_32,third,Stopwords);
>>                  PositionFilter fifth = new PositionFilter(fourth);
>>                  ShingleFilter filter = new ShingleFilter(fifth,shingleSize);
>>                   return filter;
>>                }
>> 
>> that produces the following token stream given sentence
>> 
>> "please parse this sentence into a shingle of size 2. I'll pay $2 for it"
>> 
>> 1: [_ parse:7->12:shingle]
>> 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
>> 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
>> 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
>> 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
>> 6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
>> 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
>> 8: [2:63->64:<NUM>]
>> 
>> The query analyzer produces the following analyzed query for the field "titleShingled" for above sentence:
>> 
>> ...... analyzed query:titleShingled:parse titleShingled:sentence titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay titleShingled:2
>> 
>> As you can see there is no bigram singles in the query. I tried removing the unigrams from the token stream (using  filter.setOutputUnigrams(false) in above shingles filter) but even though the singles seem to be fine the query is empty
>> 
>> 
>> 1: [_ parse:7->12:shingle]
>> 2: [parse sentence:7->26:shingle]
>> 3: [sentence shingle:18->41:shingle]
>> 4: [shingle size:34->49:shingle]
>> 5: [size 2:45->51:shingle]
>> 6: [2 pay:50->61:shingle]
>> 7: [pay 2:58->64:shingle]
>> 
>> ...... analyzed query:
>> 
>> My goal is to index both unigrams and bigrams but first try to search on bigrams. I think it is the queryparser that is parsing the shingles in a manner that I am not understanding properly.
>> 
>>                  QueryParser parser = new QueryParser(Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords));
>> 
>> Any help would be very much appreciated
>> 
>> Peyman
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


Re: Shingles Filter problems

Posted by Ian Lea <ia...@gmail.com>.
Something does appear dodgy here.  Using 3.4.0 the following very
simple code, with no custom classes

	ShingleAnalyzerWrapper saw = new ShingleAnalyzerWrapper(LUCENE_34);
	QueryParser qp = new QueryParser(LUCENE_34, "t", saw);
	String s = "simple sentences rule";
	Query q = qp.parse(s);
	System.out.printf("%s parsed to %s\n", s, q);

produces

simple sentences rule parsed to t:simple t:sentences t:rule

Like you, I would have expected there to be some shingles in there.
Are we both missing something?


--
Ian.


On Tue, Oct 11, 2011 at 3:25 PM, Peyman Faratin <pe...@robustlinks.com> wrote:
> Hi
>
> I have the following shinglefilter (Lucene 3.2)
>
>          public TokenStream tokenStream(String fieldName, Reader reader) {
>                  StandardTokenizer first = new StandardTokenizer(Version.LUCENE_32, reader);
>                  StandardFilter second = new StandardFilter(Version.LUCENE_32,first);
>                  LowerCaseFilter third = new LowerCaseFilter(Version.LUCENE_32,second);
>                  StopFilter fourth = new StopFilter(Version.LUCENE_32,third,Stopwords);
>                  PositionFilter fifth = new PositionFilter(fourth);
>                  ShingleFilter filter = new ShingleFilter(fifth,shingleSize);
>                   return filter;
>                }
>
> that produces the following token stream given sentence
>
> "please parse this sentence into a shingle of size 2. I'll pay $2 for it"
>
> 1: [_ parse:7->12:shingle]
> 2: [parse:7->12:<ALPHANUM>] [parse sentence:7->26:shingle]
> 3: [sentence:18->26:<ALPHANUM>] [sentence shingle:18->41:shingle]
> 4: [shingle:34->41:<ALPHANUM>] [shingle size:34->49:shingle]
> 5: [size:45->49:<ALPHANUM>] [size 2:45->51:shingle]
> 6: [2:50->51:<NUM>] [2 pay:50->61:shingle]
> 7: [pay:58->61:<ALPHANUM>] [pay 2:58->64:shingle]
> 8: [2:63->64:<NUM>]
>
> The query analyzer produces the following analyzed query for the field "titleShingled" for above sentence:
>
> ...... analyzed query:titleShingled:parse titleShingled:sentence titleShingled:shingle titleShingled:size titleShingled:2 titleShingled:pay titleShingled:2
>
> As you can see there is no bigram singles in the query. I tried removing the unigrams from the token stream (using  filter.setOutputUnigrams(false) in above shingles filter) but even though the singles seem to be fine the query is empty
>
>
> 1: [_ parse:7->12:shingle]
> 2: [parse sentence:7->26:shingle]
> 3: [sentence shingle:18->41:shingle]
> 4: [shingle size:34->49:shingle]
> 5: [size 2:45->51:shingle]
> 6: [2 pay:50->61:shingle]
> 7: [pay 2:58->64:shingle]
>
> ...... analyzed query:
>
> My goal is to index both unigrams and bigrams but first try to search on bigrams. I think it is the queryparser that is parsing the shingles in a manner that I am not understanding properly.
>
>                  QueryParser parser = new QueryParser(Version.LUCENE_32,"titleShingled",new ShinglesAnalyzer(2,Stopwords));
>
> Any help would be very much appreciated
>
> Peyman
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org