You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Clemens Wyss <cl...@mysign.ch> on 2011/02/21 17:05:22 UTC

Suggest search terms

I'd like to suggest search terms to my users. My naïve approach would have been:
After at least n characters have been typed (asynchronously) find terms in IndexReader.terms()  which "match"

Is there a (even) more straight forward (and possible faster) approach to get "search term suggestions"?
Could/Should the terms "per se" be indexed in an own index?
Isn't this a common desire, hence shouldn't/doesn't Lucene support this out-oif-the-box? --> Collection<String> IndexReader.termsMatching(String term)

Hope to get some real-life feedback

Thx in advance
Clemens

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggest search terms

Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.

Well, actually it depends....
If your suggestion terms corresponds with the terms in your "main" index, then 
you can use TermEnum#docFreq()+
Otherwise, if you develop a separate index for the suggestions (that do not 
correspond with the terms in your main index), then you just can add a 
calculated field with the number of documents that contain the suggestion, 
something like:

Document:
    "suggestion" field: "Ferrari 455 GT"
    "docCount" field: 20

As you control the updates in the suggestions index, this can be achieved in 
each update of the main index.

Regards.
Fernando.




________________________________
De: Simon Willnauer <si...@googlemail.com>
Para: java-user@lucene.apache.org
Enviado: martes, 22 de febrero, 2011 8:29:36
Asunto: Re: Suggest search terms

On Tue, Feb 22, 2011 at 11:23 AM, Clemens Wyss <cl...@mysign.ch> wrote:
> Fernando, Uwe thanks for your suggestions.
> Is it possible to get the number of "hits" per term?
> ferrari (125)
> lamborghini (34)
> ...

I think you can just call TermEnum#docFreq(), no?

simon
>
>> -----Ursprüngliche Nachricht-----
>> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
>> Gesendet: Montag, 21. Februar 2011 21:11
>> An: java-user@lucene.apache.org
>> Betreff: Re: Suggest search terms
>>
>> I think that the idea that Uwe mentions is completely valid. Although it has 
a
>> few disadvantages:
>>
>> For example, what if you want to suggest "multiword suggestions" and in
>> your index you have only "single word" tokens.
>>
>> Query: Ferrari
>> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
>> tokens: Ferrari, 354, 355, 356, BT, C
>>
>>
>>
>>
>>
>> ________________________________
>> De: Uwe Schindler <uw...@thetaphi.de>
>> Para: java-user@lucene.apache.org
>> Enviado: lunes, 21 de febrero, 2011 15:29:35
>> Asunto: RE: Suggest search terms
>>
>> Hi,
>>
>> I just have a suggestion to your first idea of enumerating terms, which is 
>very
>> fast if done right:
>>
>> > I'd like to suggest search terms to my users. My naïve approach would
>> > have
>> > been:
>> > After at least n characters have been typed (asynchronously) find
>> > terms in
>> > IndexReader.terms()  which "match"
>>
>> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
>> search package). Then you simply iterate (please don't forget that the enum
>> is already positioned on the first term!!! If no such term exists, the enum's
>> term() returns null). Just use a "if (enum.term() != null) do { } while
>> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
>> trunk this is much better now, but with 3.x, you have to use this ugly
>> iteration.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggest search terms

Posted by Simon Willnauer <si...@googlemail.com>.

On Tue, Feb 22, 2011 at 11:23 AM, Clemens Wyss <cl...@mysign.ch> wrote:
> Fernando, Uwe thanks for your suggestions.
> Is it possible to get the number of "hits" per term?
> ferrari (125)
> lamborghini (34)
> ...

I think you can just call TermEnum#docFreq(), no?

simon
>
>> -----Ursprüngliche Nachricht-----
>> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
>> Gesendet: Montag, 21. Februar 2011 21:11
>> An: java-user@lucene.apache.org
>> Betreff: Re: Suggest search terms
>>
>> I think that the idea that Uwe mentions is completely valid. Although it has a
>> few disadvantages:
>>
>> For example, what if you want to suggest "multiword suggestions" and in
>> your index you have only "single word" tokens.
>>
>> Query: Ferrari
>> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
>> tokens: Ferrari, 354, 355, 356, BT, C
>>
>>
>>
>>
>>
>> ________________________________
>> De: Uwe Schindler <uw...@thetaphi.de>
>> Para: java-user@lucene.apache.org
>> Enviado: lunes, 21 de febrero, 2011 15:29:35
>> Asunto: RE: Suggest search terms
>>
>> Hi,
>>
>> I just have a suggestion to your first idea of enumerating terms, which is very
>> fast if done right:
>>
>> > I'd like to suggest search terms to my users. My naïve approach would
>> > have
>> > been:
>> > After at least n characters have been typed (asynchronously) find
>> > terms in
>> > IndexReader.terms()  which "match"
>>
>> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
>> search package). Then you simply iterate (please don't forget that the enum
>> is already positioned on the first term!!! If no such term exists, the enum's
>> term() returns null). Just use a "if (enum.term() != null) do { } while
>> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
>> trunk this is much better now, but with 3.x, you have to use this ugly
>> iteration.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Suggest search terms

Posted by Clemens Wyss <cl...@mysign.ch>.

Fernando, Uwe thanks for your suggestions. 
Is it possible to get the number of "hits" per term?
ferrari (125)
lamborghini (34)
...

> -----Ursprüngliche Nachricht-----
> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
> Gesendet: Montag, 21. Februar 2011 21:11
> An: java-user@lucene.apache.org
> Betreff: Re: Suggest search terms
> 
> I think that the idea that Uwe mentions is completely valid. Although it has a
> few disadvantages:
> 
> For example, what if you want to suggest "multiword suggestions" and in
> your index you have only "single word" tokens.
> 
> Query: Ferrari
> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
> tokens: Ferrari, 354, 355, 356, BT, C
> 
> 
> 
> 
> 
> ________________________________
> De: Uwe Schindler <uw...@thetaphi.de>
> Para: java-user@lucene.apache.org
> Enviado: lunes, 21 de febrero, 2011 15:29:35
> Asunto: RE: Suggest search terms
> 
> Hi,
> 
> I just have a suggestion to your first idea of enumerating terms, which is very
> fast if done right:
> 
> > I'd like to suggest search terms to my users. My naïve approach would
> > have
> > been:
> > After at least n characters have been typed (asynchronously) find
> > terms in
> > IndexReader.terms()  which "match"
> 
> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
> search package). Then you simply iterate (please don't forget that the enum
> is already positioned on the first term!!! If no such term exists, the enum's
> term() returns null). Just use a "if (enum.term() != null) do { } while
> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
> trunk this is much better now, but with 3.x, you have to use this ugly
> iteration.
> 
> Uwe
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggest search terms

Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.

I think that the idea that Uwe mentions is completely valid. Although it has a 
few disadvantages:

For example, what if you want to suggest "multiword suggestions" and in your 
index you have only "single word" tokens.

Query: Ferrari
Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356
Index have the tokens: Ferrari, 354, 355, 356, BT, C





________________________________
De: Uwe Schindler <uw...@thetaphi.de>
Para: java-user@lucene.apache.org
Enviado: lunes, 21 de febrero, 2011 15:29:35
Asunto: RE: Suggest search terms

Hi,

I just have a suggestion to your first idea of enumerating terms, which is
very fast if done right:

> I'd like to suggest search terms to my users. My naïve approach would have
> been:
> After at least n characters have been typed (asynchronously) find terms in
> IndexReader.terms()  which "match"

Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's
in search package). Then you simply iterate (please don't forget that the
enum is already positioned on the first term!!! If no such term exists, the
enum's term() returns null). Just use a "if (enum.term() != null) do { }
while (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
trunk this is much better now, but with 3.x, you have to use this ugly
iteration.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Suggest search terms

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

I just have a suggestion to your first idea of enumerating terms, which is
very fast if done right:

> I'd like to suggest search terms to my users. My naïve approach would have
> been:
> After at least n characters have been typed (asynchronously) find terms in
> IndexReader.terms()  which "match"

Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's
in search package). Then you simply iterate (please don't forget that the
enum is already positioned on the first term!!! If no such term exists, the
enum's term() returns null). Just use a "if (enum.term() != null) do { }
while (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
trunk this is much better now, but with 3.x, you have to use this ugly
iteration.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Suggest search terms

Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.

Hello Clemens: a short time ago, I 've faced the same exact problem. Using 
Apache Solr I built a "suggest" index as a complete separated index, which 
indexes all the possible terms for suggest (terms that come from the documents 
to be indexed, using n-grams from a minimum to a maximum number of characters.

For example: if "ferrari" is a valid term for suggest, then it will be indexed 
as the following (each n-gram is a term in the index):

f
fe
fer
ferr
ferra
ferrar
ferrari

Of course, the minimum and maximum number of ngrams should be customized in 
order to not make the index bigger. For example, you start indexing starting at 
the first threee characters:

fer
ferr
ferra
ferrar
ferrari.

The token filter that I used for this is:


org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter

Take a look to that class.
Regards.
Fernando.





________________________________
De: Clemens Wyss <cl...@mysign.ch>
Para: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
Enviado: lunes, 21 de febrero, 2011 13:05:22
Asunto: Suggest search terms

I'd like to suggest search terms to my users. My naïve approach would have been:
After at least n characters have been typed (asynchronously) find terms in 
IndexReader.terms()  which "match"

Is there a (even) more straight forward (and possible faster) approach to get 
"search term suggestions"?
Could/Should the terms "per se" be indexed in an own index?
Isn't this a common desire, hence shouldn't/doesn't Lucene support this 
out-oif-the-box? --> Collection<String> IndexReader.termsMatching(String term)

Hope to get some real-life feedback

Thx in advance
Clemens

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org