You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peyman Faratin <pe...@robustlinks.com> on 2011/10/09 18:11:58 UTC

ShinglesAnalyzer Queston

Hi

I am trying to understand why I am not able to retrieve docs I have indexed by a ShingleAnalyzer. The setup is as follows:


During indexing I do the following:

		PerFieldAnalyzerWrapper wrapper = DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords);	
		writer = new IndexWriter(_lucenedir,
				new IndexWriterConfig(Version.LUCENE_32,wrapper));

where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper

		public static PerFieldAnalyzerWrapper getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){
			PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new KeywordAnalyzer());
			wrapper.addAnalyzer("title",new KeywordAnalyzer());
			wrapper.addAnalyzer("titleSynonyms",new KeywordAnalyzer());
			wrapper.addAnalyzer("date",new KeywordAnalyzer());
			wrapper.addAnalyzer("about",new KeywordAnalyzer());

			wrapper.addAnalyzer("titleAnalyzed",new StandardAnalyzer(Version.LUCENE_32,Stopwords));
			wrapper.addAnalyzer("content",new LimitTokenCountAnalyzer(
										new StandardAnalyzer(Version.LUCENE_32,Stopwords),
											Integer.MAX_VALUE));
			wrapper.addAnalyzer("contentForSpelling",new ShinglesAnalyzer(2,Stopwords));
			return wrapper;
		}

where the custom ShinglesAnalyzer is defined as follows: 

	 public class ShinglesAnalyzer extends Analyzer {
	  private HashSet<String> Stopwords;
	  private Integer shingleSize;
	  public TokenStream tokenStream(String fieldName, Reader reader) {
		  TokenStream filter = new ShingleFilter(
						new StopFilter(Version.LUCENE_32,
		    				new LowerCaseFilter(Version.LUCENE_32,
		    				new StandardFilter(Version.LUCENE_32,
		    				new StandardTokenizer(Version.LUCENE_32, reader))),
	    					Stopwords),
		    				shingleSize);		  
		   return filter;
		}
	}

Then index as follows (note, all fields are set to ANALYZED because the fields that are not analyzed are set to be KeywordAnalyzer)

				doc.add(new Field("title",title,Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
				doc.add(new Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("about",article.getAbout().toString(),Field.Store.YES, Field.Index.ANALYZED));
				doc.add(new Field("date", article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED));
				
				String content = article.getCleanContent();
				Field contentField = new Field("content",
						content, Field.Store.NO,
						Field.Index.ANALYZED,
						Field.TermVector.WITH_POSITIONS_OFFSETS);
				doc.add(contentField);
				
				Field contentSpellingField = new Field("contentForSpelling",
						content, Field.Store.YES,
						Field.Index.ANALYZED,
						Field.TermVector.WITH_POSITIONS_OFFSETS);
				doc.add(contentSpellingField);

Looking at index using luke the field "contentForSpelling" is indexed using both unigram and bi-gram (Shingles is set to be 2). 

Then during search time given a query q, which is a sentence provided by the user, I do the following:

    		  ShingleAnalyzerWrapper  analyzer = new ShinglesAnalyzer(2,Stopwords);
		  QueryParser parser = new QueryParser(Version.LUCENE_32, "contentForSpelling",analyzer);
		  Query query = parser.parse(q);
		  TopDocs hits = searcher.search(query);


This is the output

query: $13 for any of season package at Dallas

ShinglesAnalyzer:
    
1: [13:1->3:<NUM>] [13 _:1->15:shingle] 
2: [_ season:15->21:shingle] 
3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle] 
4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle] 
5: [_ dallas:33->39:shingle] 
6: [dallas:33->39:<ALPHANUM>] 

but when I print the query (query.toString()) it looks like this 

analyzed query: contentForSpelling:13 contentForSpelling:season contentForSpelling:package contentForSpelling:dallas

But the query looks wrong to me. 

thank you 

Peyman


Re: RE: merge index

Posted by janwen <to...@163.com>.
thanks Suman .
I test my code again in my project.It works,users can search the new index doc immediately.
In our website,we merge index files every 15 mins from a server to another server.and we can not stop the search service,so we can not reopen 
the index directory,thanks.
2011-10-10



janwen | China 
website : http://www.qianpin.com/




From:suman.holani
Date:2011-10-10 13:00
Subject:RE: merge index
To:java-user
Cc:

Hi janwen, 

Try reopening the index reader and make  new instance of searcher . 


Regards, 
Suman 

-----Original Message----- 
From: janwen [mailto:tom.grade1986@163.com]  
Sent: Monday, October 10, 2011 7:34 AM 
To: java-user 
Subject: merge index 

HI: 
  after  i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately? 
thanks 

2011-10-10 



janwen | China  
website : http://www.qianpin.com/ 



--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 

RE: merge index

Posted by "suman.holani" <su...@zapak.co.in>.
Hi janwen,

Try reopening the index reader and make  new instance of searcher .


Regards,
Suman

-----Original Message-----
From: janwen [mailto:tom.grade1986@163.com] 
Sent: Monday, October 10, 2011 7:34 AM
To: java-user
Subject: merge index

HI:
  after  i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately?
thanks

2011-10-10



janwen | China 
website : http://www.qianpin.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


merge index

Posted by janwen <to...@163.com>.
HI:
  after  i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately?
thanks

2011-10-10



janwen | China 
website : http://www.qianpin.com/