You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2010/01/04 06:21:56 UTC
Re: NGramTokenizer stops working after about 1000 terms
This actually rings a bell for me... have a look at Lucene's JIRA, I think this was reported as a bug once and perhaps has been fixed.
Note that Lucene in Action 2 has a case study that talks about searching source code. You may find that study interesting.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
----- Original Message ----
> From: Stefan Trcek <wz...@abas.de>
> To: java-user@lucene.apache.org
> Sent: Mon, December 14, 2009 9:39:34 AM
> Subject: NGramTokenizer stops working after about 1000 terms
>
> Hello
>
> For a source code (git repo) search engine I choose to use an ngram
> analyzer for substring search (something like "git blame").
>
> This worked fine except it didn't find some strings. I tracked it down
> to the analyzer. When the ngram analyzer yielded about 1000 terms it
> stopped yielding more terms, seem to be at most (1024 - ngram_length)
> terms. When I use StandardAnalyzer it works as expected.
> Is this a bug or did I miss a limit?
>
> Tested with lucene-2.9.1 and 3.0, this is the core routine I use:
>
> public static class NGramAnalyzer5 extends Analyzer {
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new NGramTokenizer(reader, 5, 5);
> }
> }
>
> public static String[] analyzeString(Analyzer analyzer,
> String fieldName, String string) throws IOException {
> Listoutput = new ArrayList();
> TokenStream tokenStream = analyzer.tokenStream(fieldName,
> new StringReader(string));
> TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(
> TermAttribute.class);
> tokenStream.reset();
> while (tokenStream.incrementToken()) {
> output.add(termAtt.term());
> }
> tokenStream.end();
> tokenStream.close();
> return output.toArray(new String[0]);
> }
>
> The complete example is attached. "in.txt" must be in "." and is plain
> ASCII.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org