You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Maciej Bednarz <be...@googlemail.com> on 2010/07/24 08:55:42 UTC
Searching for user agents
Hi,
I am using apache lucene 3.0.2 and searching for an optimal analyzer to search for best matching http user agents. Imagine, that we store following http user agents in a field:
Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)
Now as search query a best matching agent for the following input should be returned:
Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)
From my natural view the Mozilla/4.0 is the best fit result. What analyzer do I need to use to store and find it? The text not natural, so I need some kind of n gram search (I guess). My initial setup does not return it at all:
String agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)";
final static Analyzer analyzer = new NGramAnalyzer(2, 4);
final Document doc = new Document();
doc.add(new Field("agent", agent, Field.Store.YES, Field.Index.ANALYZED));
...
final QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
final Query query = parser.parse("Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)");
final TopScoreDocCollector collector = TopScoreDocCollector.create(50, true);
searcher.search(query, collector);
NGramAnalyzer is defined as:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
public class NGramAnalyzer extends Analyzer {
private final int minGram;
private final int maxGram;
public NGramAnalyzer(final int minGram, final int maxGram) {
this.minGram = minGram;
this.maxGram = maxGram;
}
@Override
public TokenStream tokenStream(final String fieldName, final Reader reader) {
return new NGramTokenizer(reader, minGram, maxGram);
}
}
Thank you very much for a solution or any other approach.
Maciej
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org