You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by gaz77 <ga...@bit10.net> on 2008/08/28 16:52:59 UTC

Confused with NGRAM results

Hi,

I'd appreciate if someone could explain the results I'm getting.

I've written a simple custom analyzer that applies the NGramTokenFilter to
the token stream during indexing. It's never applied during searching. The
purpose of this is to match sub-words.

Without the ngram filter, if I searched on the word 'postcode' it returns 2
documents. If I searched on 'code' it returns 6 documents (with no overlap
on the postcode results).

If I apply the ngram filter with min 1 and max 10, searching 'postcode'
returns the same 2 docs, while searching 'code' returns 9 docs. This sort-of
feels right. 

The problem comes when I set the min ngram size to 3, and the max to 5.
Searching 'postcode' returns no results (as expected), but searching 'code'
only returns 2 docs (the 2 normally returned by a 'postcode' search).

This last result for 'code' just doesn't seem correct - it should be
returning at least the 6 docs from the original search.

I'd really appreciate some advice on what is going on with the ngram filter.

Thanks
-- 
View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19202310.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confused with NGRAM results

Posted by gaz77 <ga...@bit10.net>.

Thanks for the pointer.

I've gone into this in some depth, using the AnalyzerUtils class from the
lucene in action book.

It seems that the NGramTokenFilter is only processing part of the string
that goes in. It stops tokenising the words part way through. That's why the
documents weren't found in results.

I've had a look at the source code, and I think it's because the next()
function returns null when it hits a token smaller than the min ngram size.
For example, if I set the minimum to 3, then a 2-character token will cause
it to return null.

I'm not sure if this is by design or a bug. either way, at least I know
what's causing it now.

Cheers



-- 
View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confused with NGRAM results

Posted by gaz77 <ga...@bit10.net>.

Hi Otis,

The original message text is:



> Hi,
> 
> I'd appreciate if someone could explain the results I'm getting.
> 
> I've written a simple custom analyzer that applies the NGramTokenFilter to
> the token stream during indexing. It's never applied during searching. The
> purpose of this is to match sub-words.
> 
> Without the ngram filter, if I searched on the word 'postcode' it returns
> 2 documents. If I searched on 'code' it returns 6 documents (with no
> overlap on the postcode results).
> 
> If I apply the ngram filter with min 1 and max 10, searching 'postcode'
> returns the same 2 docs, while searching 'code' returns 9 docs. This
> sort-of feels right.
> 
> The problem comes when I set the min ngram size to 3, and the max to 5.
> Searching 'postcode' returns no results (as expected), but searching
> 'code' only returns 2 docs (the 2 normally returned by a 'postcode'
> search).
> 
> This last result for 'code' just doesn't seem correct - it should be
> returning at least the 6 docs from the original search.
> 
> I'd really appreciate some advice on what is going on with the ngram
> filter.
> 
> Thanks
> 



Otis Gospodnetic wrote:
> 
> This actually sounds bugish to me, but you removed the text from your
> original email, so I don't know what context this was in.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: gaz77 <ga...@bit10.net>
>> To: java-user@lucene.apache.org
>> Sent: Friday, August 29, 2008 12:50:46 AM
>> Subject: Re: Confused with NGRAM results
>> 
>> 
>> Thanks for the pointer.
>> 
>> I've gone into this in some depth, using the AnalyzerUtils class from the
>> lucene in action book.
>> 
>> It seems that the NGramTokenFilter is only processing part of the string
>> that goes in. It stops tokenising the words part way through. That's why
>> the
>> documents weren't found in results.
>> 
>> I've had a look at the source code, and I think it's because the next()
>> function returns null when it hits a token smaller than the min ngram
>> size.
>> For example, if I set the minimum to 3, then a 2-character token will
>> cause
>> it to return null.
>> 
>> I'm not sure if this is by design or a bug. either way, at least I know
>> what's causing it now.
>> 
>> Cheers
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19210665.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19253386.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Confused with NGRAM results

Posted by Steven A Rowe <sa...@syr.edu>.

Hi gaz77,

Here's a good place to start:

<http://wiki.apache.org/jakarta-lucene/AnalysisParalysis>

Steve

On 08/28/2008 at 10:52 AM, gaz77 wrote:
> 
> Hi,
> 
> I'd appreciate if someone could explain the results I'm getting.
> 
> I've written a simple custom analyzer that applies the
> NGramTokenFilter to the token stream during indexing. It's never
> applied during searching. The purpose of this is to match sub-words.
> 
> Without the ngram filter, if I searched on the word 'postcode' it
> returns 2 documents. If I searched on 'code' it returns 6 documents
> (with no overlap on the postcode results).
> 
> If I apply the ngram filter with min 1 and max 10, searching 'postcode'
> returns the same 2 docs, while searching 'code' returns 9 docs. This
> sort-of feels right.
> 
> The problem comes when I set the min ngram size to 3, and the
> max to 5. Searching 'postcode' returns no results (as expected), but
> searching 'code' only returns 2 docs (the 2 normally returned by a
> 'postcode' search).
> 
> This last result for 'code' just doesn't seem correct - it should be
> returning at least the 6 docs from the original search.
> 
> I'd really appreciate some advice on what is going on with
> the ngram filter.
> 
> Thanks
> -- View this message in context: http://www.nabble.com/Confused-with-NGRAM-results-tp19202310p19202310.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org