You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Maciej Bednarz <> on 2010/07/24 08:55:42 UTC

Searching for user agents


I am using apache lucene 3.0.2 and searching for an optimal analyzer to search for best matching http user agents. Imagine, that we store following http user agents in a field:

Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)

Now as search query a best matching agent for the following input should be returned:

Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)

From my natural view the Mozilla/4.0 is the best fit result. What analyzer do I need to use to store and find it? The text not natural, so I need some kind of n gram search (I guess). My initial setup does not return it at all:

String agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)";
final static Analyzer analyzer = new NGramAnalyzer(2, 4);
final Document doc = new Document();
doc.add(new Field("agent", agent, Field.Store.YES, Field.Index.ANALYZED));
final QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
final Query query = parser.parse("Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)");
final TopScoreDocCollector collector = TopScoreDocCollector.create(50, true);, collector);

NGramAnalyzer is defined as:


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

	private final int minGram;
	private final int maxGram;

	public NGramAnalyzer(final int minGram, final int maxGram) {
		this.minGram = minGram;
		this.maxGram = maxGram;

	public TokenStream tokenStream(final String fieldName, final Reader reader) {
		return new NGramTokenizer(reader, minGram, maxGram);

Thank you very much for a solution or any other approach.

To unsubscribe, e-mail:
For additional commands, e-mail: