You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peyman Faratin <pe...@robustlinks.com> on 2011/10/09 18:11:58 UTC
ShinglesAnalyzer Queston
Hi
I am trying to understand why I am not able to retrieve docs I have indexed by a ShingleAnalyzer. The setup is as follows:
During indexing I do the following:
PerFieldAnalyzerWrapper wrapper = DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords);
writer = new IndexWriter(_lucenedir,
new IndexWriterConfig(Version.LUCENE_32,wrapper));
where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper
public static PerFieldAnalyzerWrapper getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new KeywordAnalyzer());
wrapper.addAnalyzer("title",new KeywordAnalyzer());
wrapper.addAnalyzer("titleSynonyms",new KeywordAnalyzer());
wrapper.addAnalyzer("date",new KeywordAnalyzer());
wrapper.addAnalyzer("about",new KeywordAnalyzer());
wrapper.addAnalyzer("titleAnalyzed",new StandardAnalyzer(Version.LUCENE_32,Stopwords));
wrapper.addAnalyzer("content",new LimitTokenCountAnalyzer(
new StandardAnalyzer(Version.LUCENE_32,Stopwords),
Integer.MAX_VALUE));
wrapper.addAnalyzer("contentForSpelling",new ShinglesAnalyzer(2,Stopwords));
return wrapper;
}
where the custom ShinglesAnalyzer is defined as follows:
public class ShinglesAnalyzer extends Analyzer {
private HashSet<String> Stopwords;
private Integer shingleSize;
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream filter = new ShingleFilter(
new StopFilter(Version.LUCENE_32,
new LowerCaseFilter(Version.LUCENE_32,
new StandardFilter(Version.LUCENE_32,
new StandardTokenizer(Version.LUCENE_32, reader))),
Stopwords),
shingleSize);
return filter;
}
}
Then index as follows (note, all fields are set to ANALYZED because the fields that are not analyzed are set to be KeywordAnalyzer)
doc.add(new Field("title",title,Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("about",article.getAbout().toString(),Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("date", article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED));
String content = article.getCleanContent();
Field contentField = new Field("content",
content, Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(contentField);
Field contentSpellingField = new Field("contentForSpelling",
content, Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(contentSpellingField);
Looking at index using luke the field "contentForSpelling" is indexed using both unigram and bi-gram (Shingles is set to be 2).
Then during search time given a query q, which is a sentence provided by the user, I do the following:
ShingleAnalyzerWrapper analyzer = new ShinglesAnalyzer(2,Stopwords);
QueryParser parser = new QueryParser(Version.LUCENE_32, "contentForSpelling",analyzer);
Query query = parser.parse(q);
TopDocs hits = searcher.search(query);
This is the output
query: $13 for any of season package at Dallas
ShinglesAnalyzer:
1: [13:1->3:<NUM>] [13 _:1->15:shingle]
2: [_ season:15->21:shingle]
3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle]
4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle]
5: [_ dallas:33->39:shingle]
6: [dallas:33->39:<ALPHANUM>]
but when I print the query (query.toString()) it looks like this
analyzed query: contentForSpelling:13 contentForSpelling:season contentForSpelling:package contentForSpelling:dallas
But the query looks wrong to me.
thank you
Peyman
Re: RE: merge index
Posted by janwen <to...@163.com>.
thanks Suman .
I test my code again in my project.It works,users can search the new index doc immediately.
In our website,we merge index files every 15 mins from a server to another server.and we can not stop the search service,so we can not reopen
the index directory,thanks.
2011-10-10
janwen | China
website : http://www.qianpin.com/
From:suman.holani
Date:2011-10-10 13:00
Subject:RE: merge index
To:java-user
Cc:
Hi janwen,
Try reopening the index reader and make new instance of searcher .
Regards,
Suman
-----Original Message-----
From: janwen [mailto:tom.grade1986@163.com]
Sent: Monday, October 10, 2011 7:34 AM
To: java-user
Subject: merge index
HI:
after i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately?
thanks
2011-10-10
janwen | China
website : http://www.qianpin.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: merge index
Posted by "suman.holani" <su...@zapak.co.in>.
Hi janwen,
Try reopening the index reader and make new instance of searcher .
Regards,
Suman
-----Original Message-----
From: janwen [mailto:tom.grade1986@163.com]
Sent: Monday, October 10, 2011 7:34 AM
To: java-user
Subject: merge index
HI:
after i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately?
thanks
2011-10-10
janwen | China
website : http://www.qianpin.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
merge index
Posted by janwen <to...@163.com>.
HI:
after i add new index directory to exsited directoy,but i can not search the new index docs.but when i restart my tomcat server,i can do that.I am wondering that i need to restart tomcat to search the new index doc? or is there other way to do that the new index docs can be searched immediately?
thanks
2011-10-10
janwen | China
website : http://www.qianpin.com/