You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by amid <am...@donanza.com> on 2015/06/10 19:26:47 UTC

The best way to exclude "seen" results from search queries

Hi,

We have a solr index with ~1M documents.
We want to give the ability to our users to filter results from queries -
meaning they will not shown again for any query of this specific user (we
currently have 10K users).

You can think of a scenario like a "recommendation engine" which you don't
want to give recommendation more than once for each user.

What is the best way to implement this feature (Performance & Memory)?

Thanks,
Ami



--
View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: The best way to exclude "seen" results from search queries

Posted by Upayavira <uv...@odoko.co.uk>.

On Thu, Jun 11, 2015, at 07:20 PM, amid wrote:
> Thanks Charles,
> 
> We though of using multi-valued field but got the feeling it will not be
> small as our data will grow.
> Another issue with multi-valued field is that you can't create complex
> join
> query, while using a different collection with document with more than
> one
> field (e.g. recommendation_date) can help us easily delete/limit the
> amount
> of time this recommendation will not be shown again.
> 
> Thanks for your answer, seems like replication & load balancing will be
> good
> enough for now :)

Regarding multivalued, I agree with your assessment. Yes, limit the
number of returned recommendations by date, that will help avoid a high
cardinality and thus poor performance.

Effectively what the join does is say, "go find me the ID for all docs
which were recommended to user $USER. Now, in my original index, please
find all docs for this list of IDs". The more IDs there are, the worse
the performance.

Upayavira

RE: The best way to exclude "seen" results from search queries

Posted by amid <am...@donanza.com>.

Thanks Charles,

We though of using multi-valued field but got the feeling it will not be
small as our data will grow.
Another issue with multi-valued field is that you can't create complex join
query, while using a different collection with document with more than one
field (e.g. recommendation_date) can help us easily delete/limit the amount
of time this recommendation will not be shown again.

Thanks for your answer, seems like replication & load balancing will be good
enough for now :)

Thanks allot, Ami



--
View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211239.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: The best way to exclude "seen" results from search queries

Posted by Upayavira <uv...@odoko.co.uk>.

It is the number of recommendations for a single user that matter. The
more there are, the worse the performance. Try it and see is the best
way though.

I personally would have one doc per recommendation. It will reduce the
amount of churn in your index as updating a multivalued field will
involve deleting the entire document that preceded it, which will then
need merging, etc. One doc per recommendation effectively makes your
index write-only, which is much cleaner.

Regarding sharding, you can shard your original index, but a replica of
your user recommendations collection must exist on every shard/replica
of that original index. It cannot be sharded.

HTH

Upayavira

On Thu, Jun 11, 2015, at 06:06 PM, Reitzel, Charles wrote:
> So long as the fields are indexed, I think performance should be ok.
> 
> Personally, I would also look at using a single document per user with a
> multi-valued field for recommendation ID.   Assuming only a small
> fraction of all recommendation IDs are ever presented to any single user,
> this schema would be physically much smaller and require only a single
> document per user.
> 
> I don't know the answer to your sharding question.   The join query is
> available out of the box, so it should be quick work to set up a
> two-shard sample and test the distributed sub-query.
> 
> That said, with the scales you are talking about, I question if sharding
> is necessary.   You can still use replication for load balancing without
> sharding.
> 
> -----Original Message-----
> From: amid [mailto:amid@donanza.com] 
> Sent: Thursday, June 11, 2015 12:36 PM
> To: solr-user@lucene.apache.org
> Subject: RE: The best way to exclude "seen" results from search queries
> 
> Thanks allot Charles,
> 
> This seems to be what I'm looking for.
> Do you know if join for this amount of documents & user will still have
> good query performance? also, is there any limitations for the solr
> architecture once using the "join" method (i.e. sharding)?
> 
> Many thanks,
> Ami
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
> 
> TIAA-CREF
> *************************************************************************
>

RE: The best way to exclude "seen" results from search queries

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.

So long as the fields are indexed, I think performance should be ok.

Personally, I would also look at using a single document per user with a multi-valued field for recommendation ID.   Assuming only a small fraction of all recommendation IDs are ever presented to any single user, this schema would be physically much smaller and require only a single document per user.

I don't know the answer to your sharding question.   The join query is available out of the box, so it should be quick work to set up a two-shard sample and test the distributed sub-query.

That said, with the scales you are talking about, I question if sharding is necessary.   You can still use replication for load balancing without sharding.

-----Original Message-----
From: amid [mailto:amid@donanza.com] 
Sent: Thursday, June 11, 2015 12:36 PM
To: solr-user@lucene.apache.org
Subject: RE: The best way to exclude "seen" results from search queries

Thanks allot Charles,

This seems to be what I'm looking for.
Do you know if join for this amount of documents & user will still have good query performance? also, is there any limitations for the solr architecture once using the "join" method (i.e. sharding)?

Many thanks,
Ami



--
View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html
Sent from the Solr - User mailing list archive at Nabble.com.

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

RE: The best way to exclude "seen" results from search queries

Posted by amid <am...@donanza.com>.

Thanks allot Charles,

This seems to be what I'm looking for.
Do you know if join for this amount of documents & user will still have good
query performance? also, is there any limitations for the solr architecture
once using the "join" method (i.e. sharding)?

Many thanks,
Ami



--
View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022p4211223.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The best way to exclude "seen" results from search queries

Posted by "Reitzel, Charles" <Ch...@tiaa-cref.org>.

I don't see any way around storing which recommendations have been delivered to each user.  Sounds like a separate collection with the unique ID created from the combination of the user ID and the recommendation ID (with the IDs also available as a separate, searchable and returnable fields).   

You could then use a so-called "join" query to exclude any recommendations in the other collection.

-----Original Message-----
From: amid [mailto:amid@donanza.com] 
Sent: Wednesday, June 10, 2015 1:27 PM
To: solr-user@lucene.apache.org
Subject: The best way to exclude "seen" results from search queries

Hi,

We have a solr index with ~1M documents.
We want to give the ability to our users to filter results from queries - meaning they will not shown again for any query of this specific user (we currently have 10K users).

You can think of a scenario like a "recommendation engine" which you don't want to give recommendation more than once for each user.

What is the best way to implement this feature (Performance & Memory)?

Thanks,
Ami

--
View this message in context: http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022.html
Sent from the Solr - User mailing list archive at Nabble.com.

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

Re: The best way to exclude "seen" results from search queries

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

start with negating and bypassing caches by
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser
eg
fq=-{!terms f=p_id cache=false}1,3,5,already,seen
note:
Elastic can even store such filters via
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_mechanism


On Wed, Jun 10, 2015 at 8:26 PM, amid <am...@donanza.com> wrote:

> Hi,
>
> We have a solr index with ~1M documents.
> We want to give the ability to our users to filter results from queries -
> meaning they will not shown again for any query of this specific user (we
> currently have 10K users).
>
> You can think of a scenario like a "recommendation engine" which you don't
> want to give recommendation more than once for each user.
>
> What is the best way to implement this feature (Performance & Memory)?
>
> Thanks,
> Ami
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/The-best-way-to-exclude-seen-results-from-search-queries-tp4211022.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>