You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Clemens Wyss <cl...@mysign.ch> on 2011/02/21 17:05:22 UTC
Suggest search terms
I'd like to suggest search terms to my users. My naïve approach would have been:
After at least n characters have been typed (asynchronously) find terms in IndexReader.terms() which "match"
Is there a (even) more straight forward (and possible faster) approach to get "search term suggestions"?
Could/Should the terms "per se" be indexed in an own index?
Isn't this a common desire, hence shouldn't/doesn't Lucene support this out-oif-the-box? --> Collection<String> IndexReader.termsMatching(String term)
Hope to get some real-life feedback
Thx in advance
Clemens
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Suggest search terms
Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.
Well, actually it depends....
If your suggestion terms corresponds with the terms in your "main" index, then
you can use TermEnum#docFreq()+
Otherwise, if you develop a separate index for the suggestions (that do not
correspond with the terms in your main index), then you just can add a
calculated field with the number of documents that contain the suggestion,
something like:
Document:
"suggestion" field: "Ferrari 455 GT"
"docCount" field: 20
As you control the updates in the suggestions index, this can be achieved in
each update of the main index.
Regards.
Fernando.
________________________________
De: Simon Willnauer <si...@googlemail.com>
Para: java-user@lucene.apache.org
Enviado: martes, 22 de febrero, 2011 8:29:36
Asunto: Re: Suggest search terms
On Tue, Feb 22, 2011 at 11:23 AM, Clemens Wyss <cl...@mysign.ch> wrote:
> Fernando, Uwe thanks for your suggestions.
> Is it possible to get the number of "hits" per term?
> ferrari (125)
> lamborghini (34)
> ...
I think you can just call TermEnum#docFreq(), no?
simon
>
>> -----Ursprüngliche Nachricht-----
>> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
>> Gesendet: Montag, 21. Februar 2011 21:11
>> An: java-user@lucene.apache.org
>> Betreff: Re: Suggest search terms
>>
>> I think that the idea that Uwe mentions is completely valid. Although it has
a
>> few disadvantages:
>>
>> For example, what if you want to suggest "multiword suggestions" and in
>> your index you have only "single word" tokens.
>>
>> Query: Ferrari
>> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
>> tokens: Ferrari, 354, 355, 356, BT, C
>>
>>
>>
>>
>>
>> ________________________________
>> De: Uwe Schindler <uw...@thetaphi.de>
>> Para: java-user@lucene.apache.org
>> Enviado: lunes, 21 de febrero, 2011 15:29:35
>> Asunto: RE: Suggest search terms
>>
>> Hi,
>>
>> I just have a suggestion to your first idea of enumerating terms, which is
>very
>> fast if done right:
>>
>> > I'd like to suggest search terms to my users. My naïve approach would
>> > have
>> > been:
>> > After at least n characters have been typed (asynchronously) find
>> > terms in
>> > IndexReader.terms() which "match"
>>
>> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
>> search package). Then you simply iterate (please don't forget that the enum
>> is already positioned on the first term!!! If no such term exists, the enum's
>> term() returns null). Just use a "if (enum.term() != null) do { } while
>> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
>> trunk this is much better now, but with 3.x, you have to use this ugly
>> iteration.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Suggest search terms
Posted by Simon Willnauer <si...@googlemail.com>.
On Tue, Feb 22, 2011 at 11:23 AM, Clemens Wyss <cl...@mysign.ch> wrote:
> Fernando, Uwe thanks for your suggestions.
> Is it possible to get the number of "hits" per term?
> ferrari (125)
> lamborghini (34)
> ...
I think you can just call TermEnum#docFreq(), no?
simon
>
>> -----Ursprüngliche Nachricht-----
>> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
>> Gesendet: Montag, 21. Februar 2011 21:11
>> An: java-user@lucene.apache.org
>> Betreff: Re: Suggest search terms
>>
>> I think that the idea that Uwe mentions is completely valid. Although it has a
>> few disadvantages:
>>
>> For example, what if you want to suggest "multiword suggestions" and in
>> your index you have only "single word" tokens.
>>
>> Query: Ferrari
>> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
>> tokens: Ferrari, 354, 355, 356, BT, C
>>
>>
>>
>>
>>
>> ________________________________
>> De: Uwe Schindler <uw...@thetaphi.de>
>> Para: java-user@lucene.apache.org
>> Enviado: lunes, 21 de febrero, 2011 15:29:35
>> Asunto: RE: Suggest search terms
>>
>> Hi,
>>
>> I just have a suggestion to your first idea of enumerating terms, which is very
>> fast if done right:
>>
>> > I'd like to suggest search terms to my users. My naïve approach would
>> > have
>> > been:
>> > After at least n characters have been typed (asynchronously) find
>> > terms in
>> > IndexReader.terms() which "match"
>>
>> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
>> search package). Then you simply iterate (please don't forget that the enum
>> is already positioned on the first term!!! If no such term exists, the enum's
>> term() returns null). Just use a "if (enum.term() != null) do { } while
>> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
>> trunk this is much better now, but with 3.x, you have to use this ugly
>> iteration.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
AW: Suggest search terms
Posted by Clemens Wyss <cl...@mysign.ch>.
Fernando, Uwe thanks for your suggestions.
Is it possible to get the number of "hits" per term?
ferrari (125)
lamborghini (34)
...
> -----Ursprüngliche Nachricht-----
> Von: Fernando Wasylyszyn [mailto:ferwasy@yahoo.com.ar]
> Gesendet: Montag, 21. Februar 2011 21:11
> An: java-user@lucene.apache.org
> Betreff: Re: Suggest search terms
>
> I think that the idea that Uwe mentions is completely valid. Although it has a
> few disadvantages:
>
> For example, what if you want to suggest "multiword suggestions" and in
> your index you have only "single word" tokens.
>
> Query: Ferrari
> Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356 Index have the
> tokens: Ferrari, 354, 355, 356, BT, C
>
>
>
>
>
> ________________________________
> De: Uwe Schindler <uw...@thetaphi.de>
> Para: java-user@lucene.apache.org
> Enviado: lunes, 21 de febrero, 2011 15:29:35
> Asunto: RE: Suggest search terms
>
> Hi,
>
> I just have a suggestion to your first idea of enumerating terms, which is very
> fast if done right:
>
> > I'd like to suggest search terms to my users. My naïve approach would
> > have
> > been:
> > After at least n characters have been typed (asynchronously) find
> > terms in
> > IndexReader.terms() which "match"
>
> Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's in
> search package). Then you simply iterate (please don't forget that the enum
> is already positioned on the first term!!! If no such term exists, the enum's
> term() returns null). Just use a "if (enum.term() != null) do { } while
> (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
> trunk this is much better now, but with 3.x, you have to use this ugly
> iteration.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Suggest search terms
Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.
I think that the idea that Uwe mentions is completely valid. Although it has a
few disadvantages:
For example, what if you want to suggest "multiword suggestions" and in your
index you have only "single word" tokens.
Query: Ferrari
Ideal suggestions: Ferrari 354 BT, Ferrari 355 C, Ferrari 356
Index have the tokens: Ferrari, 354, 355, 356, BT, C
________________________________
De: Uwe Schindler <uw...@thetaphi.de>
Para: java-user@lucene.apache.org
Enviado: lunes, 21 de febrero, 2011 15:29:35
Asunto: RE: Suggest search terms
Hi,
I just have a suggestion to your first idea of enumerating terms, which is
very fast if done right:
> I'd like to suggest search terms to my users. My naïve approach would have
> been:
> After at least n characters have been typed (asynchronously) find terms in
> IndexReader.terms() which "match"
Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's
in search package). Then you simply iterate (please don't forget that the
enum is already positioned on the first term!!! If no such term exists, the
enum's term() returns null). Just use a "if (enum.term() != null) do { }
while (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
trunk this is much better now, but with 3.x, you have to use this ugly
iteration.
Uwe
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Suggest search terms
Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,
I just have a suggestion to your first idea of enumerating terms, which is
very fast if done right:
> I'd like to suggest search terms to my users. My naïve approach would have
> been:
> After at least n characters have been typed (asynchronously) find terms in
> IndexReader.terms() which "match"
Much easier is to use IR.terms() but wrap a PrefixTermEnum around it (it's
in search package). Then you simply iterate (please don't forget that the
enum is already positioned on the first term!!! If no such term exists, the
enum's term() returns null). Just use a "if (enum.term() != null) do { }
while (enum.next()!=null && numberOfTermCollectex <= max)", with Lucene
trunk this is much better now, but with 3.x, you have to use this ugly
iteration.
Uwe
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Suggest search terms
Posted by Fernando Wasylyszyn <fe...@yahoo.com.ar>.
Hello Clemens: a short time ago, I 've faced the same exact problem. Using
Apache Solr I built a "suggest" index as a complete separated index, which
indexes all the possible terms for suggest (terms that come from the documents
to be indexed, using n-grams from a minimum to a maximum number of characters.
For example: if "ferrari" is a valid term for suggest, then it will be indexed
as the following (each n-gram is a term in the index):
f
fe
fer
ferr
ferra
ferrar
ferrari
Of course, the minimum and maximum number of ngrams should be customized in
order to not make the index bigger. For example, you start indexing starting at
the first threee characters:
fer
ferr
ferra
ferrar
ferrari.
The token filter that I used for this is:
org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter
Take a look to that class.
Regards.
Fernando.
________________________________
De: Clemens Wyss <cl...@mysign.ch>
Para: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
Enviado: lunes, 21 de febrero, 2011 13:05:22
Asunto: Suggest search terms
I'd like to suggest search terms to my users. My naïve approach would have been:
After at least n characters have been typed (asynchronously) find terms in
IndexReader.terms() which "match"
Is there a (even) more straight forward (and possible faster) approach to get
"search term suggestions"?
Could/Should the terms "per se" be indexed in an own index?
Isn't this a common desire, hence shouldn't/doesn't Lucene support this
out-oif-the-box? --> Collection<String> IndexReader.termsMatching(String term)
Hope to get some real-life feedback
Thx in advance
Clemens
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org