You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Cam Bazz <ca...@gmail.com> on 2008/02/13 17:54:02 UTC

matching products with suggest feature

Hello;

I am trying to make a product matcher based on lucene's ngram based suggest.
I did some changes so that instead of giving the speller a dictionary I feed
it with a List<String>.

For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
1.83GHz/512MB/80GB/12.1''
NOTEBOOK"
and I index it with speller using an ngram approach.

It works quite well - when using the suggest feature, for example if the
user submits something similar. similar as in the string lenght is
relatively equal, a word or two might be mistyped - or even missing, lucene
finds it.
However - when the user submits the same product - but with much less or
much more string length - for example "HP NC4400 EY605EA" or "HP NC4400
EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH WINDOWS XP
AND GIFT MOUSE" - the suggester wont work.

I am not sure about the ngrams approach any more.

Any ideas/recomendations/help greatly appreciated.

Best Regards,
C.B.

Re: matching products with suggest feature

Posted by Shai Erera <se...@gmail.com>.
If it adds the clauses as Occur.SHOULD, it means they should appear, but
does not have to appear.
Looking at suggestSimilar, it looks like it computes the edit_distance
values of the requested word and the suggestions. If the score is lower than
the minimum score, it may skip the word.
Could you try indexing a document with the word "abcde" and request
suggestions for "abce". It may be, like in your example, that if the
requested word is too different than the actual word (for example if their
length is very different), it will fail to achieve what you need.

On Thu, Feb 14, 2008 at 12:25 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello Shai,
>
> Thats right, Speller is in the contrib.it is named spellchecker. Basically
> it is a special index that stores the words as ngrams.
> I looked at the code to see how it is querying the index and basically it
> makes ngrams and adds each ngram to a boolean query.
>
> Here is how it adds to the boolean query. I could not find out whether it
> is
> AND or OR
>
> Best.
>
>  private static void add(BooleanQuery q, String name, String value, float
> boost) {
>    Query tq = new TermQuery(new Term(name, value));
>    tq.setBoost(boost);
>    q.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
>  }
>
>  private static void add(BooleanQuery q, String name, String value) {
>    q.add(new BooleanClause(new TermQuery(new Term(name, value)),
> BooleanClause.Occur.SHOULD));
>   }
>
> On Thu, Feb 14, 2008 at 8:44 AM, Shai Erera <se...@gmail.com> wrote:
>
> > Is this Speller class a Lucene class? I didn't find it in the main code
> > stream, maybe it's part of contrib?
> >
> > Anyway, still it depends how it is implemented (OR or AND). For example,
> > someone indexed a document with the word "abcde" and the index keeps the
> > ngrams "abc", "bcd" and "cde". Then somebody types in "abc", what would
> > the
> > speller suggest? What would the speller suggest for "abce"?
> > If it works in an OR mode, I assume it would suggest "abcde" for both,
> as
> > "abc" appears in both. But if it works in AND mode, then for the first
> it
> > will suggest "abcde" but for the second it won't suggest it because the
> > ngrams produced are "abc" and "bce" .. and "bce" does not appear in
> > "abcde".
> >
> > Am I right? If not, can you elaborate more on the Speller class you use?
> >
> > On Wed, Feb 13, 2008 at 8:19 PM, Cam Bazz <ca...@gmail.com> wrote:
> >
> > > Hello Shai,
> > >
> > > The class that does the matching is Speller.
> > > It does not work query based but rather there is a method called -
> > > suggestSimilar(String word, int numSug); where the numSug is number of
> > > suggestions. The words are kept in the index as ngrams. For example
> > abcde
> > > is
> > > kept as abc bcd cde.
> > > So this is not normal query like we all know.
> > >
> > > Best regards,
> > > C.B.
> > >
> > >
> > > On Feb 13, 2008 7:00 PM, Shai Erera <se...@gmail.com> wrote:
> > >
> > > > What is the default Operator of your QueryParser? Is it AND_OPERATOR
> > or
> > > > OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then
> once
> > > you
> > > > add more terms than what exists, it won't find anything.
> > > >
> > > > On Feb 13, 2008 6:54 PM, Cam Bazz <ca...@gmail.com> wrote:
> > > >
> > > > > Hello;
> > > > >
> > > > > I am trying to make a product matcher based on lucene's ngram
> based
> > > > > suggest.
> > > > > I did some changes so that instead of giving the speller a
> > dictionary
> > > I
> > > > > feed
> > > > > it with a List<String>.
> > > > >
> > > > > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> > > > > 1.83GHz/512MB/80GB/12.1''
> > > > > NOTEBOOK"
> > > > > and I index it with speller using an ngram approach.
> > > > >
> > > > > It works quite well - when using the suggest feature, for example
> if
> > > the
> > > > > user submits something similar. similar as in the string lenght is
> > > > > relatively equal, a word or two might be mistyped - or even
> missing,
> > > > > lucene
> > > > > finds it.
> > > > > However - when the user submits the same product - but with much
> > less
> > > or
> > > > > much more string length - for example "HP NC4400 EY605EA" or "HP
> > > NC4400
> > > > > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH
> > > WINDOWS
> > > > > XP
> > > > > AND GIFT MOUSE" - the suggester wont work.
> > > > >
> > > > > I am not sure about the ngrams approach any more.
> > > > >
> > > > > Any ideas/recomendations/help greatly appreciated.
> > > > >
> > > > > Best Regards,
> > > > > C.B.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Shai Erera
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
>



-- 
Regards,

Shai Erera

Re: matching products with suggest feature

Posted by Cam Bazz <ca...@gmail.com>.
Hello Shai,

Thats right, Speller is in the contrib.it is named spellchecker. Basically
it is a special index that stores the words as ngrams.
I looked at the code to see how it is querying the index and basically it
makes ngrams and adds each ngram to a boolean query.

Here is how it adds to the boolean query. I could not find out whether it is
AND or OR

Best.

  private static void add(BooleanQuery q, String name, String value, float
boost) {
    Query tq = new TermQuery(new Term(name, value));
    tq.setBoost(boost);
    q.add(new BooleanClause(tq, BooleanClause.Occur.SHOULD));
  }

  private static void add(BooleanQuery q, String name, String value) {
    q.add(new BooleanClause(new TermQuery(new Term(name, value)),
BooleanClause.Occur.SHOULD));
  }

On Thu, Feb 14, 2008 at 8:44 AM, Shai Erera <se...@gmail.com> wrote:

> Is this Speller class a Lucene class? I didn't find it in the main code
> stream, maybe it's part of contrib?
>
> Anyway, still it depends how it is implemented (OR or AND). For example,
> someone indexed a document with the word "abcde" and the index keeps the
> ngrams "abc", "bcd" and "cde". Then somebody types in "abc", what would
> the
> speller suggest? What would the speller suggest for "abce"?
> If it works in an OR mode, I assume it would suggest "abcde" for both, as
> "abc" appears in both. But if it works in AND mode, then for the first it
> will suggest "abcde" but for the second it won't suggest it because the
> ngrams produced are "abc" and "bce" .. and "bce" does not appear in
> "abcde".
>
> Am I right? If not, can you elaborate more on the Speller class you use?
>
> On Wed, Feb 13, 2008 at 8:19 PM, Cam Bazz <ca...@gmail.com> wrote:
>
> > Hello Shai,
> >
> > The class that does the matching is Speller.
> > It does not work query based but rather there is a method called -
> > suggestSimilar(String word, int numSug); where the numSug is number of
> > suggestions. The words are kept in the index as ngrams. For example
> abcde
> > is
> > kept as abc bcd cde.
> > So this is not normal query like we all know.
> >
> > Best regards,
> > C.B.
> >
> >
> > On Feb 13, 2008 7:00 PM, Shai Erera <se...@gmail.com> wrote:
> >
> > > What is the default Operator of your QueryParser? Is it AND_OPERATOR
> or
> > > OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once
> > you
> > > add more terms than what exists, it won't find anything.
> > >
> > > On Feb 13, 2008 6:54 PM, Cam Bazz <ca...@gmail.com> wrote:
> > >
> > > > Hello;
> > > >
> > > > I am trying to make a product matcher based on lucene's ngram based
> > > > suggest.
> > > > I did some changes so that instead of giving the speller a
> dictionary
> > I
> > > > feed
> > > > it with a List<String>.
> > > >
> > > > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> > > > 1.83GHz/512MB/80GB/12.1''
> > > > NOTEBOOK"
> > > > and I index it with speller using an ngram approach.
> > > >
> > > > It works quite well - when using the suggest feature, for example if
> > the
> > > > user submits something similar. similar as in the string lenght is
> > > > relatively equal, a word or two might be mistyped - or even missing,
> > > > lucene
> > > > finds it.
> > > > However - when the user submits the same product - but with much
> less
> > or
> > > > much more string length - for example "HP NC4400 EY605EA" or "HP
> > NC4400
> > > > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH
> > WINDOWS
> > > > XP
> > > > AND GIFT MOUSE" - the suggester wont work.
> > > >
> > > > I am not sure about the ngrams approach any more.
> > > >
> > > > Any ideas/recomendations/help greatly appreciated.
> > > >
> > > > Best Regards,
> > > > C.B.
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Shai Erera
> > >
> >
>
>
>
> --
> Regards,
>
> Shai Erera
>

Re: matching products with suggest feature

Posted by Shai Erera <se...@gmail.com>.
Is this Speller class a Lucene class? I didn't find it in the main code
stream, maybe it's part of contrib?

Anyway, still it depends how it is implemented (OR or AND). For example,
someone indexed a document with the word "abcde" and the index keeps the
ngrams "abc", "bcd" and "cde". Then somebody types in "abc", what would the
speller suggest? What would the speller suggest for "abce"?
If it works in an OR mode, I assume it would suggest "abcde" for both, as
"abc" appears in both. But if it works in AND mode, then for the first it
will suggest "abcde" but for the second it won't suggest it because the
ngrams produced are "abc" and "bce" .. and "bce" does not appear in "abcde".

Am I right? If not, can you elaborate more on the Speller class you use?

On Wed, Feb 13, 2008 at 8:19 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello Shai,
>
> The class that does the matching is Speller.
> It does not work query based but rather there is a method called -
> suggestSimilar(String word, int numSug); where the numSug is number of
> suggestions. The words are kept in the index as ngrams. For example abcde
> is
> kept as abc bcd cde.
> So this is not normal query like we all know.
>
> Best regards,
> C.B.
>
>
> On Feb 13, 2008 7:00 PM, Shai Erera <se...@gmail.com> wrote:
>
> > What is the default Operator of your QueryParser? Is it AND_OPERATOR or
> > OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once
> you
> > add more terms than what exists, it won't find anything.
> >
> > On Feb 13, 2008 6:54 PM, Cam Bazz <ca...@gmail.com> wrote:
> >
> > > Hello;
> > >
> > > I am trying to make a product matcher based on lucene's ngram based
> > > suggest.
> > > I did some changes so that instead of giving the speller a dictionary
> I
> > > feed
> > > it with a List<String>.
> > >
> > > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> > > 1.83GHz/512MB/80GB/12.1''
> > > NOTEBOOK"
> > > and I index it with speller using an ngram approach.
> > >
> > > It works quite well - when using the suggest feature, for example if
> the
> > > user submits something similar. similar as in the string lenght is
> > > relatively equal, a word or two might be mistyped - or even missing,
> > > lucene
> > > finds it.
> > > However - when the user submits the same product - but with much less
> or
> > > much more string length - for example "HP NC4400 EY605EA" or "HP
> NC4400
> > > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH
> WINDOWS
> > > XP
> > > AND GIFT MOUSE" - the suggester wont work.
> > >
> > > I am not sure about the ngrams approach any more.
> > >
> > > Any ideas/recomendations/help greatly appreciated.
> > >
> > > Best Regards,
> > > C.B.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Shai Erera
> >
>



-- 
Regards,

Shai Erera

Re: matching products with suggest feature

Posted by Cam Bazz <ca...@gmail.com>.
Hello Shai,

The class that does the matching is Speller.
It does not work query based but rather there is a method called -
suggestSimilar(String word, int numSug); where the numSug is number of
suggestions. The words are kept in the index as ngrams. For example abcde is
kept as abc bcd cde.
So this is not normal query like we all know.

Best regards,
C.B.


On Feb 13, 2008 7:00 PM, Shai Erera <se...@gmail.com> wrote:

> What is the default Operator of your QueryParser? Is it AND_OPERATOR or
> OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once you
> add more terms than what exists, it won't find anything.
>
> On Feb 13, 2008 6:54 PM, Cam Bazz <ca...@gmail.com> wrote:
>
> > Hello;
> >
> > I am trying to make a product matcher based on lucene's ngram based
> > suggest.
> > I did some changes so that instead of giving the speller a dictionary I
> > feed
> > it with a List<String>.
> >
> > For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> > 1.83GHz/512MB/80GB/12.1''
> > NOTEBOOK"
> > and I index it with speller using an ngram approach.
> >
> > It works quite well - when using the suggest feature, for example if the
> > user submits something similar. similar as in the string lenght is
> > relatively equal, a word or two might be mistyped - or even missing,
> > lucene
> > finds it.
> > However - when the user submits the same product - but with much less or
> > much more string length - for example "HP NC4400 EY605EA" or "HP NC4400
> > EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH WINDOWS
> > XP
> > AND GIFT MOUSE" - the suggester wont work.
> >
> > I am not sure about the ngrams approach any more.
> >
> > Any ideas/recomendations/help greatly appreciated.
> >
> > Best Regards,
> > C.B.
> >
>
>
>
> --
> Regards,
>
> Shai Erera
>

Re: matching products with suggest feature

Posted by Shai Erera <se...@gmail.com>.
What is the default Operator of your QueryParser? Is it AND_OPERATOR or
OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once you
add more terms than what exists, it won't find anything.

On Feb 13, 2008 6:54 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello;
>
> I am trying to make a product matcher based on lucene's ngram based
> suggest.
> I did some changes so that instead of giving the speller a dictionary I
> feed
> it with a List<String>.
>
> For example lets say I have "HP NC4400 EY605EA CORE 2 DUO T5600
> 1.83GHz/512MB/80GB/12.1''
> NOTEBOOK"
> and I index it with speller using an ngram approach.
>
> It works quite well - when using the suggest feature, for example if the
> user submits something similar. similar as in the string lenght is
> relatively equal, a word or two might be mistyped - or even missing,
> lucene
> finds it.
> However - when the user submits the same product - but with much less or
> much more string length - for example "HP NC4400 EY605EA" or "HP NC4400
> EY605EA CORE 2 DUO T5600 1.83GHz/512MB/80GB/12.1'' NOTEBOOK WITH WINDOWS
> XP
> AND GIFT MOUSE" - the suggester wont work.
>
> I am not sure about the ngrams approach any more.
>
> Any ideas/recomendations/help greatly appreciated.
>
> Best Regards,
> C.B.
>



-- 
Regards,

Shai Erera

Re: matching products with suggest feature

Posted by eddiec <tw...@hotmail.com>.
I haven't used Lucene are read the Lucene book in quite a while since I
handed over my university thesis quite a few years ago. However I'm
currently building an ecommerce site from an asp skeleton, the current
search and recommendation algorithms are built on limited SQL searches but
I'd like to migrate these to a much more relevant search result producing
index based search. I'd be more keen to go with an out of the box product
but funds are limited and I've already played with all the usual toys like
sphider and found these much to limited and the ecommerce search solutions
very expensive. Since My days of working with Lucene is there any toys to
accomplish this goal floating around the net that integrate with the Lucene
library.

-----
Eddie Crozier AgriEngineering.co.uk -  http://www.agriengineering.co.uk
Vintage Tractor Parts 
-- 
View this message in context: http://old.nabble.com/matching-products-with-suggest-feature-tp15462672p26796228.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.