You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christian Reuschling <ch...@gmail.com> on 2013/11/13 18:04:30 UTC

FuzzySuggester EXACT_FIRST criteria

We started to implement a named entity recognition on the base of AnalyzingSuggester, which offers
the great support for Synonyms, Stopwords, etc.
For this, we slightly modified AnalyzingSuggester.lookup() to only return the exactFirst hits
(considering the exactFirst code block only, skipping the 'sameSurfaceForm' check and break, to get
the synonym hits too).

This works pretty good, and our next step would be to bring in some fuzzyness against spelling
mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester instead.

Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the same
prefix - also different/misspelled terms inside the edit distance are considered as 'not exact',
which means we get the same results as with AnalyzingSuggester.

query: "screen"
misspelled query: "screan"
dictionary: "screen", "screensaver"

AnalyzingSuggester hits: screen, screensaver
AnalyzingSuggester hits on misspelled query: <empty>
AnalyzingSuggester EXACT_FIRST hits: screen
AnalyzingSuggester EXACT_FIRST hits on misspelled query: <empty>

FuzzySuggester hits: screen, screensaver
FuzzySuggester hits on misspelled query: screen, screensaver
FuzzySuggester EXACT_FIRST hits: screen
FuzzySuggester EXACT_FIRST hits on misspelled query: <empty> => TARGET: screen

Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST aspect
'END_BYTE arc leaving'. Maybe these can be set differently when building the Levenshtein automata? I
have no clue.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: FuzzySuggester EXACT_FIRST criteria

Posted by Christian Reuschling <ch...@gmail.com>.

I created a test class for rapid testing that should be runnable out of the box, with
LUCENE-SUGGEST-5.0-SNAPSHOT (maven) dependency. (see attachment)

Because I can't subclass from the final FuzzySuggester I subclassed AnalyzingSuggester, delegating
all 3 method calls 'convertAutomaton, getFullPrefixPaths and getTokenStreamToAutomaton' +
constructor over an internal FuzzySuggester member.

Then I overrided AnalyzingSuggester.lookup(..) by copying it from AnalyzingSuggester, sadly invoking
2 methods and reading some field members with reflection api because of their private declaration
(alternative would be to copy everything). Everything worked as expected so far.

I added our slight modification - moving the getFullPrefixPaths invocation to the first prefix path
creation.

The main class checks a simple scenario with KeywordAnalyzer, three term dictionary and some query
term variations.
Here is the output. Sadly some (for me) unexpected results:

Dictionary: [screen, screensaver, mouse]

query: 'screan' - exact result as expected (correct). But not in any case! This is when one letter
is changed, which is not the first or last one.
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screew' - last letter changed: exact result empty (incorrect).
Exact results:
All results:
  screen/1
  screensaver/1

query: 'wcreen' - first letter changed: nothing found at all.
Exact results:
All results:

query: 'scree' - last letter removed.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'scren' - 5th letter removed. Same as with last removed letter.
Exact results:
All results:
  screen/1
  screensaver/1

query: 'sreen' - 2th letter removed. Why different?
Exact results:
  screen/1
All results: - double entry of 'screen'?
  screen/1
  screen/1
  screensaver/1

query: 'screen' - correct query: screen not found at all?
Exact results:
All results:
  screensaver/1

Now, my latin is at the end (as we say in Germany ;) ). Don't know how to proceed further, as the
deeper code starts to become very complex.

Thanks a lot!

Christian Reuschling

On 15.11.2013 18:49, Michael McCandless wrote:
> Hmm, I'm not sure offhand why that change gives you no results.
> 
> The fullPrefixPaths should have been a super-set of the original
> prefix paths, since the LevA just adds further paths.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
> <ch...@gmail.com> wrote:
>> I tried it by changing the first prefixPath initialization to
>>
>> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
>> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>>
>> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>>
>> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled query.
>>
>> Correct spelled query:
>> prefixPaths size == 1
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>>   (without getFullPrefixPath: non-null)
>>
>> Query within edit distance - the same:
>> prefixPaths size == 1   (without getFullPrefixPath: 0)
>> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>>
>> Query outside of edit distance:
>> prefixPaths size = 0
>>
>> Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs ?
>>
>>
>>
>> On 14.11.2013 17:05, Michael McCandless wrote:
>>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling <ch...@gmail.com> wrote:
>>>> We started to implement a named entity recognition on the base of AnalyzingSuggester, which
>>>> offers the great support for Synonyms, Stopwords, etc. For this, we slightly modified
>>>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering the exactFirst
>>>> code block only, skipping the 'sameSurfaceForm' check and break, to get the synonym hits
>>>> too).
>>>>
>>>> This works pretty good, and our next step would be to bring in some fuzzyness against
>>>> spelling mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester
>>>> instead.
>>>>
>>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the
>>>> same prefix - also different/misspelled terms inside the edit distance are considered as 'not
>>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>>
>>>>
>>>> query: "screen" misspelled query: "screan" dictionary: "screen", "screensaver"
>>>>
>>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on misspelled query:
>>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester EXACT_FIRST hits on
>>>> misspelled query: <empty>
>>>>
>>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled query: screen,
>>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester EXACT_FIRST hits on
>>>> misspelled query: <empty> => TARGET: screen
>>>>
>>>>
>>>> Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST
>>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when building the
>>>> Levenshtein automata? I have no clue.
>>>
>>> It seems like the problem is that AnalyzingSuggester checks for exactFirst before calling
>>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the fuzziness)?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> --------------------------------------------------------------------- To unsubscribe, e-mail:
>>> java-user-unsubscribe@lucene.apache.org For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: FuzzySuggester EXACT_FIRST criteria

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm, I'm not sure offhand why that change gives you no results.

The fullPrefixPaths should have been a super-set of the original
prefix paths, since the LevA just adds further paths.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Nov 14, 2013 at 2:43 PM, Christian Reuschling
<ch...@gmail.com> wrote:
> I tried it by changing the first prefixPath initialization to
>
> List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
>     FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
> prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);
>
> inside AnalyzingSuggester.lookup(..). (simply copied the line from below)
>
> Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled query.
>
> Correct spelled query:
> prefixPaths size == 1
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>   (without getFullPrefixPath: non-null)
>
> Query within edit distance - the same:
> prefixPaths size == 1   (without getFullPrefixPath: 0)
> returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
>
> Query outside of edit distance:
> prefixPaths size = 0
>
> Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs ?
>
>
>
> On 14.11.2013 17:05, Michael McCandless wrote:
>> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling <ch...@gmail.com> wrote:
>>> We started to implement a named entity recognition on the base of AnalyzingSuggester, which
>>> offers the great support for Synonyms, Stopwords, etc. For this, we slightly modified
>>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering the exactFirst
>>> code block only, skipping the 'sameSurfaceForm' check and break, to get the synonym hits
>>> too).
>>>
>>> This works pretty good, and our next step would be to bring in some fuzzyness against
>>> spelling mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester
>>> instead.
>>>
>>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the
>>> same prefix - also different/misspelled terms inside the edit distance are considered as 'not
>>> exact', which means we get the same results as with AnalyzingSuggester.
>>>
>>>
>>> query: "screen" misspelled query: "screan" dictionary: "screen", "screensaver"
>>>
>>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on misspelled query:
>>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester EXACT_FIRST hits on
>>> misspelled query: <empty>
>>>
>>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled query: screen,
>>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester EXACT_FIRST hits on
>>> misspelled query: <empty> => TARGET: screen
>>>
>>>
>>> Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST
>>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when building the
>>> Levenshtein automata? I have no clue.
>>
>> It seems like the problem is that AnalyzingSuggester checks for exactFirst before calling
>> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the fuzziness)?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> --------------------------------------------------------------------- To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: FuzzySuggester EXACT_FIRST criteria

Posted by Christian Reuschling <ch...@gmail.com>.

I tried it by changing the first prefixPath initialization to

List<FSTUtil.Path<Pair<Long,BytesRef>>> prefixPaths =
    FSTUtil.intersectPrefixPaths(convertAutomaton(lookupAutomaton), fst);
prefixPaths = getFullPrefixPaths(prefixPaths, lookupAutomaton, fst);

inside AnalyzingSuggester.lookup(..). (simply copied the line from below)

Sadly, FuzzySuggester now gives no hits at all, even with a correct spelled query.

Correct spelled query:
prefixPaths size == 1
returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)
  (without getFullPrefixPath: non-null)

Query within edit distance - the same:
prefixPaths size == 1   (without getFullPrefixPath: 0)
returns null: fst.findTargetArc(END_BYTE, path.fstNode, scratchArc, bytesReader)

Query outside of edit distance:
prefixPaths size = 0

Seems like the fuzziness is there, but getFullPrefixPaths kicks all END_BYTEs ?



On 14.11.2013 17:05, Michael McCandless wrote:
> On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling <ch...@gmail.com> wrote:
>> We started to implement a named entity recognition on the base of AnalyzingSuggester, which
>> offers the great support for Synonyms, Stopwords, etc. For this, we slightly modified
>> AnalyzingSuggester.lookup() to only return the exactFirst hits (considering the exactFirst
>> code block only, skipping the 'sameSurfaceForm' check and break, to get the synonym hits
>> too).
>> 
>> This works pretty good, and our next step would be to bring in some fuzzyness against
>> spelling mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester
>> instead.
>> 
>> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the
>> same prefix - also different/misspelled terms inside the edit distance are considered as 'not
>> exact', which means we get the same results as with AnalyzingSuggester.
>> 
>> 
>> query: "screen" misspelled query: "screan" dictionary: "screen", "screensaver"
>> 
>> AnalyzingSuggester hits: screen, screensaver AnalyzingSuggester hits on misspelled query:
>> <empty> AnalyzingSuggester EXACT_FIRST hits: screen AnalyzingSuggester EXACT_FIRST hits on
>> misspelled query: <empty>
>> 
>> FuzzySuggester hits: screen, screensaver FuzzySuggester hits on misspelled query: screen,
>> screensaver FuzzySuggester EXACT_FIRST hits: screen FuzzySuggester EXACT_FIRST hits on
>> misspelled query: <empty> => TARGET: screen
>> 
>> 
>> Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST
>> aspect 'END_BYTE arc leaving'. Maybe these can be set differently when building the
>> Levenshtein automata? I have no clue.
> 
> It seems like the problem is that AnalyzingSuggester checks for exactFirst before calling
> .getFullPrefixPaths (which, in FuzzySuggester subclass, applies the fuzziness)?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> --------------------------------------------------------------------- To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: FuzzySuggester EXACT_FIRST criteria

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Nov 13, 2013 at 12:04 PM, Christian Reuschling
<ch...@gmail.com> wrote:
> We started to implement a named entity recognition on the base of AnalyzingSuggester, which offers
> the great support for Synonyms, Stopwords, etc.
> For this, we slightly modified AnalyzingSuggester.lookup() to only return the exactFirst hits
> (considering the exactFirst code block only, skipping the 'sameSurfaceForm' check and break, to get
> the synonym hits too).
>
> This works pretty good, and our next step would be to bring in some fuzzyness against spelling
> mistakes. For this, the idea was to do exactly the same, but with FuzzySuggester instead.
>
> Now we have the problem that 'EXCACT_FIRST' in FuzzySuggester not only relies on sharing the same
> prefix - also different/misspelled terms inside the edit distance are considered as 'not exact',
> which means we get the same results as with AnalyzingSuggester.
>
>
> query: "screen"
> misspelled query: "screan"
> dictionary: "screen", "screensaver"
>
> AnalyzingSuggester hits: screen, screensaver
> AnalyzingSuggester hits on misspelled query: <empty>
> AnalyzingSuggester EXACT_FIRST hits: screen
> AnalyzingSuggester EXACT_FIRST hits on misspelled query: <empty>
>
> FuzzySuggester hits: screen, screensaver
> FuzzySuggester hits on misspelled query: screen, screensaver
> FuzzySuggester EXACT_FIRST hits: screen
> FuzzySuggester EXACT_FIRST hits on misspelled query: <empty> => TARGET: screen
>
>
> Is there a possibility to distinguish? I see that the 'exact' criteria relies on an FST aspect
> 'END_BYTE arc leaving'. Maybe these can be set differently when building the Levenshtein automata? I
> have no clue.

It seems like the problem is that AnalyzingSuggester checks for
exactFirst before calling .getFullPrefixPaths (which, in
FuzzySuggester subclass, applies the fuzziness)?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org