You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sentsov Eugeny <eu...@gmail.com> on 2011/09/20 12:21:01 UTC

autocomplete with popularity

hello,
Is there autocomplete which counts requests and sorts suggestions according
to this count? Ie if users request "redlands" 50 times and  reckless 20
times then suggestions for "re" should be
"redlands"
"reckless"

Re: autocomplete with popularity

Posted by Markus Jelsma <ma...@openindex.io>.

> The original request was for suggestions ranked purely by request count.
> You have designed something more complicated that probably works better.
> 
> When I built query completion at Netflix, I used the movie rental rates to
> rank suggestions. That was simple and very effective. We didn't need a
> more complicated system because we started with a good metric.

Good point! I got carried away by a user asking about sorting on request 
count. A metric like the one you describe is a lot easier indeed :)

Cheers

> 
> wunder
> 
> On Sep 20, 2011, at 4:34 PM, Markus Jelsma wrote:
> >> Of course you can fight spam. And the spammers can fight back. I prefer
> >> algorithms that don't require an arms race with spammers.
> >> 
> >> There are other problems with using query frequency. What about all the
> >> legitimate users that type "google" or "facebook" into the query box
> >> instead of into the location bar? What about the frequent queries that
> >> don't match anything on your site?
> > 
> > How would that be a problem if you collect the information? The query
> > logs provide numFound and QTime and a lot more information and we
> > collect cookie ID's and (hashed) IP-address for the same request.
> > 
> > We also collect the type of query is issued so we can identify a
> > _legitimate_ (this is something we can reasonably detect) user using the
> > same search terms when sorting, paging, facetting etc. If it is not a
> > legitimate user we can act accordingly.
> > 
> > This would count for +1 for the search term. The final count can then be
> > passed through a log to flatten it out. If things still get out of
> > control we would most likely deal with a DOS attack instead.
> > 
> >> If an algorithm needs that many patches, it is fundamentally a weak
> >> approach.
> > 
> > I do not agree. There are many conditions to consider.
> > 
> >> wunder
> >> 
> >> On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:
> >>> A query log parser can be written to detect spam. At first you can use
> >>> cookies (e.g. sessions) and IP-addresses to detect term spam. You can
> >>> also limit a popularity spike to a reasonable mean size over a longer
> >>> period. And you can limit rates using logarithms.
> >>> 
> >>> There are many ways to deal with spam and maintain decent statistics.
> >>> 
> >>> In practice, it's not a big problem on most sites.
> >>> 
> >>>> Ranking suggestions based on query count would be trivially easy to
> >>>> spam. Have a bot make my preferred queries over and over again, and
> >>>> "boom" they are the most-preferred.

Re: autocomplete with popularity

Posted by Walter Underwood <wu...@wunderwood.org>.

The original request was for suggestions ranked purely by request count. You have designed something more complicated that probably works better.

When I built query completion at Netflix, I used the movie rental rates to rank suggestions. That was simple and very effective. We didn't need a more complicated system because we started with a good metric.

wunder

On Sep 20, 2011, at 4:34 PM, Markus Jelsma wrote:

> 
>> Of course you can fight spam. And the spammers can fight back. I prefer
>> algorithms that don't require an arms race with spammers.
>> 
>> There are other problems with using query frequency. What about all the
>> legitimate users that type "google" or "facebook" into the query box
>> instead of into the location bar? What about the frequent queries that
>> don't match anything on your site?
> 
> How would that be a problem if you collect the information? The query logs 
> provide numFound and QTime and a lot more information and we collect cookie 
> ID's and (hashed) IP-address for the same request.
> 
> We also collect the type of query is issued so we can identify a _legitimate_ 
> (this is something we can reasonably detect) user using the same search terms 
> when sorting, paging, facetting etc. If it is not a legitimate user we can act 
> accordingly.
> 
> This would count for +1 for the search term. The final count can then be 
> passed through a log to flatten it out. If things still get out of control we 
> would most likely deal with a DOS attack instead.
> 
>> 
>> If an algorithm needs that many patches, it is fundamentally a weak
>> approach.
> 
> I do not agree. There are many conditions to consider.
> 
>> 
>> wunder
>> 
>> On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:
>>> A query log parser can be written to detect spam. At first you can use
>>> cookies (e.g. sessions) and IP-addresses to detect term spam. You can
>>> also limit a popularity spike to a reasonable mean size over a longer
>>> period. And you can limit rates using logarithms.
>>> 
>>> There are many ways to deal with spam and maintain decent statistics.
>>> 
>>> In practice, it's not a big problem on most sites.
>>> 
>>>> Ranking suggestions based on query count would be trivially easy to
>>>> spam. Have a bot make my preferred queries over and over again, and
>>>> "boom" they are the most-preferred.

Re: autocomplete with popularity

Posted by Markus Jelsma <ma...@openindex.io>.

> Of course you can fight spam. And the spammers can fight back. I prefer
> algorithms that don't require an arms race with spammers.
> 
> There are other problems with using query frequency. What about all the
> legitimate users that type "google" or "facebook" into the query box
> instead of into the location bar? What about the frequent queries that
> don't match anything on your site?

How would that be a problem if you collect the information? The query logs 
provide numFound and QTime and a lot more information and we collect cookie 
ID's and (hashed) IP-address for the same request.

We also collect the type of query is issued so we can identify a _legitimate_ 
(this is something we can reasonably detect) user using the same search terms 
when sorting, paging, facetting etc. If it is not a legitimate user we can act 
accordingly.

This would count for +1 for the search term. The final count can then be 
passed through a log to flatten it out. If things still get out of control we 
would most likely deal with a DOS attack instead.

> 
> If an algorithm needs that many patches, it is fundamentally a weak
> approach.

I do not agree. There are many conditions to consider.

> 
> wunder
> 
> On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:
> > A query log parser can be written to detect spam. At first you can use
> > cookies (e.g. sessions) and IP-addresses to detect term spam. You can
> > also limit a popularity spike to a reasonable mean size over a longer
> > period. And you can limit rates using logarithms.
> > 
> > There are many ways to deal with spam and maintain decent statistics.
> > 
> > In practice, it's not a big problem on most sites.
> > 
> >> Ranking suggestions based on query count would be trivially easy to
> >> spam. Have a bot make my preferred queries over and over again, and
> >> "boom" they are the most-preferred.

Re: autocomplete with popularity

Posted by Walter Underwood <wu...@wunderwood.org>.

Of course you can fight spam. And the spammers can fight back. I prefer algorithms that don't require an arms race with spammers.

There are other problems with using query frequency. What about all the legitimate users that type "google" or "facebook" into the query box instead of into the location bar? What about the frequent queries that don't match anything on your site?

If an algorithm needs that many patches, it is fundamentally a weak approach.

wunder

On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:

> A query log parser can be written to detect spam. At first you can use cookies 
> (e.g. sessions) and IP-addresses to detect term spam. You can also limit a 
> popularity spike to a reasonable mean size over a longer period. And you can 
> limit rates using logarithms.
> 
> There are many ways to deal with spam and maintain decent statistics. 
> 
> In practice, it's not a big problem on most sites.
> 
>> Ranking suggestions based on query count would be trivially easy to spam.
>> Have a bot make my preferred queries over and over again, and "boom" they
>> are the most-preferred.

Re: autocomplete with popularity

Posted by Markus Jelsma <ma...@openindex.io>.

A query log parser can be written to detect spam. At first you can use cookies 
(e.g. sessions) and IP-addresses to detect term spam. You can also limit a 
popularity spike to a reasonable mean size over a longer period. And you can 
limit rates using logarithms.

There are many ways to deal with spam and maintain decent statistics. 

In practice, it's not a big problem on most sites.

> Ranking suggestions based on query count would be trivially easy to spam.
> Have a bot make my preferred queries over and over again, and "boom" they
> are the most-preferred.

Re: autocomplete with popularity

Posted by Walter Underwood <wu...@wunderwood.org>.

Ranking suggestions based on query count would be trivially easy to spam. Have a bot make my preferred queries over and over again, and "boom" they are the most-preferred.

wunder

On Sep 20, 2011, at 3:41 PM, Markus Jelsma wrote:

> At least, i assumed this is what the user asked for when i read "which counts 
> requests and sorts suggestions according to this count"
> 
>> No. The spellchecker and suggester only operate on the index (tf*idf) and
>> do not account for user generated input which is what the user asks for.
>> 
>> You need to parse the query logs periodically index query strings and
>> #occurences in the query logs as a float value (or use ExternalFileField)
>> to obtain a popularity rate to sort on.
>> 
>> This new index can then be queried as auto suggest; n-grams are commonly
>> used for this. This means both "redlands" and "reckless" are returned for
>> the query "re". Sort it and you've got the desired result.
>> 
>> I would not recommend storing user input (the queries) and the actual
>> documents in the same index, for many reasons.
>> 
>>> From http://wiki.apache.org/solr/Suggester :
>>> 
>>> spellcheck.onlyMorePopular=true - if this parameter is set to true then
>>> the suggestions will be sorted by weight ("popularity") - the count
>>> parameter will effectively limit this to a top-N list of best
>>> suggestions. If this is set to false then suggestions are sorted
>>> alphabetically.
>>> 
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/autocomplete-with-popularity-tp3352755
>>> p 3352919.html Sent from the Solr - User mailing list archive at
>>> Nabble.com.

Re: autocomplete with popularity

Posted by Markus Jelsma <ma...@openindex.io>.

At least, i assumed this is what the user asked for when i read "which counts 
requests and sorts suggestions according to this count"

> No. The spellchecker and suggester only operate on the index (tf*idf) and
> do not account for user generated input which is what the user asks for.
> 
> You need to parse the query logs periodically index query strings and
> #occurences in the query logs as a float value (or use ExternalFileField)
> to obtain a popularity rate to sort on.
> 
> This new index can then be queried as auto suggest; n-grams are commonly
> used for this. This means both "redlands" and "reckless" are returned for
> the query "re". Sort it and you've got the desired result.
> 
> I would not recommend storing user input (the queries) and the actual
> documents in the same index, for many reasons.
> 
> > From http://wiki.apache.org/solr/Suggester :
> > 
> > spellcheck.onlyMorePopular=true - if this parameter is set to true then
> > the suggestions will be sorted by weight ("popularity") - the count
> > parameter will effectively limit this to a top-N list of best
> > suggestions. If this is set to false then suggestions are sorted
> > alphabetically.
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/autocomplete-with-popularity-tp3352755
> > p 3352919.html Sent from the Solr - User mailing list archive at
> > Nabble.com.

Re: autocomplete with popularity

Posted by Markus Jelsma <ma...@openindex.io>.

No. The spellchecker and suggester only operate on the index (tf*idf) and do 
not account for user generated input which is what the user asks for.

You need to parse the query logs periodically index query strings and 
#occurences in the query logs as a float value (or use ExternalFileField) to 
obtain a popularity rate to sort on.

This new index can then be queried as auto suggest; n-grams are commonly used 
for this. This means both "redlands" and "reckless" are returned for the query 
"re". Sort it and you've got the desired result.

I would not recommend storing user input (the queries) and the actual 
documents in the same index, for many reasons.

> From http://wiki.apache.org/solr/Suggester :
> 
> spellcheck.onlyMorePopular=true - if this parameter is set to true then the
> suggestions will be sorted by weight ("popularity") - the count parameter
> will effectively limit this to a top-N list of best suggestions. If this is
> set to false then suggestions are sorted alphabetically.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/autocomplete-with-popularity-tp3352755p
> 3352919.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: autocomplete with popularity

Posted by "O. Klein" <kl...@octoweb.nl>.

>From http://wiki.apache.org/solr/Suggester :

spellcheck.onlyMorePopular=true - if this parameter is set to true then the
suggestions will be sorted by weight ("popularity") - the count parameter
will effectively limit this to a top-N list of best suggestions. If this is
set to false then suggestions are sorted alphabetically. 

--
View this message in context: http://lucene.472066.n3.nabble.com/autocomplete-with-popularity-tp3352755p3352919.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: autocomplete with popularity

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Eugeny,

I think you want something more useful and less problematic as Wunder already pointed out.

Wouldn't you want your suggestions to be ordered by how close of match they are?  And do you really want them to be purely prefix-based like in your example?

What if people are searching for Michael Jackson a lot, but a person comes and starts typing Jackso.... would you not want to suggest Michael Jackson?  This is not to say you can't mix in popularity or some other factors that you know you can rely on.

Try the AutoComplete on http://search-lucene.com/ to see whether that feels like the right search experience.  For example, start typing the word "expert".  Because matches (sub)strings are bold, you will easily see where in suggested phrases this matches.

See also: http://sematext.com/products/autocomplete/index.html - I think one of the example configurations that this thing comes with actually does show how to mix in something like popularity.


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


----- Original Message -----
> From: Sentsov Eugeny <eu...@gmail.com>
> To: solr-user@lucene.apache.org
> Cc: 
> Sent: Tuesday, September 20, 2011 1:05 PM
> Subject: autocomplete with popularity
> 
> hello,
> Is there autocomplete which counts requests and sorts suggestions according
> to this count? Ie if users request "redlands" 50 times and  reckless 
> 20
> times then suggestions for "re" should be
> "redlands"
> "reckless"
>

autocomplete with popularity

Posted by Sentsov Eugeny <eu...@gmail.com>.

hello,
Is there autocomplete which counts requests and sorts suggestions according
to this count? Ie if users request "redlands" 50 times and  reckless 20
times then suggestions for "re" should be
"redlands"
"reckless"