You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dbenjamin <be...@gmail.com> on 2012/03/31 15:57:45 UTC

Content privacy, search & index

Hi,

I'm relatively new to Solr, new in the way that i already used Solr several
times but always with a very simple approach, meaning simple fulltext search
with faceting and filtering.

Today, i've to go a bit further and before i do, i'd like to get your point
of view ;-)

I need to index users and user contents that are subject to privacy levels
like for instance :

* Anyone
* Only me
* Only my friends
* Only people i choose

...really classic.

So, when an user searches for contents on the website, in the results, we
can't show him the content elements he is not allowed to see.

My first thought was : "There might be a way to do that with complex solr
queries"

So i start reading the documentation, and i have to say that i understand
half of the things i read :-)

And then, a new idea came to my mind. I was thinking about this process :

1- The user submits the search form with his keywords
2- I prepare a classic fulltext search query
3- I compute some way the friend list of the current user
4- I add a filter to the Solr query with the result of that
5- I send the query

While this seems reasonable since i can add some cache system in the way to
avoid computing the friend list each time, i don't know why, it doesn't feel
right ;-)

The other way would be to index users and users friends and somehow letting
solr doing all the job.

What do you think ? Is the second solution even possible ?


Thanks !
Br,

Benjamin.

--
View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3873462.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Content privacy, search & index

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Hello Benjamin,

Le 1 avr. 2012 à 11:48, dbenjamin a écrit :
> You lost me :-)
> You mean implementing a specific RequestHandler just for my needs ?

I think a QueryComponent is enough, it'd extend QueryComponent.
It's prepare method reads all the params and calls the ResponseBuilder's setQuery with the redefined query.

> Also, when you say "It'd transform a query for "a b"

this is an example query from the client.
If you launch the QueryParser on it, you get a BooleanQuery with clauses TermQuery for a (in the "default field") and a TermQuery for b (in the "default field"). This is done for you if you call super.prepare then collect the query: it's probably a booleanquery, or you wrap.

> into "+(a b) +(authorizedBit)"", that's not so clear to me, do you mind explaining this
> like i was a 6 years old ? ;-) (even if I think that's just a matter of
> syntax...)

you'd do something such as the following:

// assemble a booleanquery bq2 with all the necessary bits (e.g. indicating the term-queries that say owner:<userName>)

bq = new BooleanQuery();
bq1 = new BooleanQuery();
// add termqueries for a and b into bq1, 
bq.add(bq1, BooleanQuery.Occurs.MUST); // that's the +
bq.add(bq2, BooleanQuery.Occurs.MUST); // and another +
// assemble bq3 that woudl "prefer" particular things, e.g. prefer things of users in my group
bq.add(bq3, BooleanQuery.Occurs.SHOULD) // no +, just impacts weight but is not required

That's the way I implement query-expansion.
I'm afraid I do not know a place where this is documented.

paul

> 
> Indeed, the friend list will obviously be cached.
> 
> Thanks.
> 
> Br,
> Benjamin.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3874961.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Content privacy, search & index

Posted by dbenjamin <be...@gmail.com>.
Hi Paul,

You lost me :-)

You mean implementing a specific RequestHandler just for my needs ?

Also, when you say "It'd transform a query for "a b" into "+(a b)
+(authorizedBit)"", that's not so clear to me, do you mind explaining this
like i was a 6 years old ? ;-) (even if I think that's just a matter of
syntax...)

Indeed, the friend list will obviously be cached.

Thanks.

Br,
Benjamin.

--
View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3874961.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Content privacy, search & index

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Benjamin,

I think implementing a QueryHandler that adds the necessary query is the right way to do that.
It'd transform a query for "a b" into "+(a b) +(authorizedBit)" (to use the language of the default QueryParser but please not by substring, using the real query objects!).

Recalculating the friends-list... well... that all depends on your authorization system, this should be cached somewhere in a session or so, ideally you'd even cache somewhere close to there the queries that you add. Then performance is likely to be ok.

paul


Le 31 mars 2012 à 15:57, dbenjamin a écrit :

> Hi,
> 
> I'm relatively new to Solr, new in the way that i already used Solr several
> times but always with a very simple approach, meaning simple fulltext search
> with faceting and filtering.
> 
> Today, i've to go a bit further and before i do, i'd like to get your point
> of view ;-)
> 
> I need to index users and user contents that are subject to privacy levels
> like for instance :
> 
> * Anyone
> * Only me
> * Only my friends
> * Only people i choose
> 
> ...really classic.
> 
> So, when an user searches for contents on the website, in the results, we
> can't show him the content elements he is not allowed to see.
> 
> My first thought was : "There might be a way to do that with complex solr
> queries"
> 
> So i start reading the documentation, and i have to say that i understand
> half of the things i read :-)
> 
> And then, a new idea came to my mind. I was thinking about this process :
> 
> 1- The user submits the search form with his keywords
> 2- I prepare a classic fulltext search query
> 3- I compute some way the friend list of the current user
> 4- I add a filter to the Solr query with the result of that
> 5- I send the query
> 
> While this seems reasonable since i can add some cache system in the way to
> avoid computing the friend list each time, i don't know why, it doesn't feel
> right ;-)
> 
> The other way would be to index users and users friends and somehow letting
> solr doing all the job.
> 
> What do you think ? Is the second solution even possible ?
> 
> 
> Thanks !
> Br,
> 
> Benjamin.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3873462.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Content privacy, search & index

Posted by dbenjamin <be...@gmail.com>.
Hi spring,

Solution 1 is what i had in mind.

So i can't do the whole thing directly in Solr ? (Except maybe by
implementing a new RequestHandler like Paul suggested)

Concerning the auto-complete of friends in the search box, you won't use the
auto-complete feature from Solr then, will you ? Because the friend list
would not be indexed in Solr but retrieved from application cache. (well, if
you don't have answer to that it's ok because that's really secondary, the
privacy level being my primary concern.)

Concerning the size of the data, we have to consider they could grow
exponentially.
The hypothesis is : 300K users, an average of 100 friends each and 200
documents (each).


Br,
Benjamin.


--
View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3874982.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Content privacy, search & index

Posted by sp...@gmx.eu.
> - Is it the best way to do that ?
> - It's obvious that i need to index the registered users in 
> Solr (because an
> user can search for others), but is it clever to index friend 
> list for each
> user as well ? (if we take a look at the search box on 
> Facebook, or other
> any sexy social network, they propose auto-complete for current user
> friends, so maybe it makes sense...)

This is a common question:

How to merge the resultlist from solr (A) with a resultlist from elsewhere
(B) (offen a RDBMS like in you case).

3 options:

1) do the merge in A:

* fetch the ids from B and do the merge in A (e.g. filterQuery in Solr, be
aware of maxBooleanClauses).

2) do the merge in B:

* fetch the ids from A and do the merge in B (e.g. subselect, has limitation
in big number of Ids too).

3) do the merge in the application (C):

* fetch the ids from A and B and intersect them in C

Depending on the size of the resultsets one of the 3 options is the best ;)



Re: Content privacy, search & index

Posted by dbenjamin <be...@gmail.com>.
Hi Erick and thanks for the quick reply.

Well, my intend would not to index all content elements with its own list of
authorized users IDs.

I was thinking more of something like I index the contents and the users +
their friend list separatly, and then being able somehow to ask Solr to
filter results of one index depending on the other. In SQL it would be
performed with a sub-query, for instance : 

SELECT a.* FROM albums a
  LEFT JOIN albums_settings as ON as.album_id = a.id
  WHERE as.privacy_level = 'anyone'
    OR 
    (as.privacy_level = 'friends'
       AND a.owner_id IN (
         SELECT friend_id FROM friendships
         WHERE user_id = :currentUserId
       )
    )
;

It would require some tests, but this request should return only what we
want.

I was assuming by "complex query" that Solr would be able to perform that
kind of operation.

But if it does, that raises some questions :

- Is it the best way to do that ?
- It's obvious that i need to index the registered users in Solr (because an
user can search for others), but is it clever to index friend list for each
user as well ? (if we take a look at the search box on Facebook, or other
any sexy social network, they propose auto-complete for current user
friends, so maybe it makes sense...)


Br,
Benjamin.

--
View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3874112.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Content privacy, search & index

Posted by Erick Erickson <er...@gmail.com>.
The second option is actually possible, and actually easiest in
terms of letting Solr do the most work. Presumably you have
some web-app facing the user, you could pre-calculate the
list of authorized viewers there on some kind of session basis.

Be careful, however, that this list of IDs doesn't grow unbounded.
Anything up to several hundred is probably OK, but thousands gets
to be problematical.

The other thing to consider here is whether the list of authorized users
changes much. The downside is that you have to index the authorized
users with the document. So if I have 10,000 docs and add someone
to my friend list, you have to re-index all 10,000 docs.

Complex Solr queries... don't quite know what that would look like.
My rule of thumb is that when I try to make Solr behave like a DB
through queries...it's probably wrong.

There is another option, the "no cache" filter queries, see:
https://issues.apache.org/jira/browse/SOLR-2429
This was actually designed for ACL checking. You say
you're relatively new to Solr, so making sense of this may
be a bit daunting. The basic idea is that you can
implement a PostFilter where each matching document,
after all other selection criteria have been met, is sent
to your code which can then give a go-no go answer. This
keeps all the facets etc. accurate.

Best
Erick

On Sat, Mar 31, 2012 at 9:57 AM, dbenjamin <be...@gmail.com> wrote:
> Hi,
>
> I'm relatively new to Solr, new in the way that i already used Solr several
> times but always with a very simple approach, meaning simple fulltext search
> with faceting and filtering.
>
> Today, i've to go a bit further and before i do, i'd like to get your point
> of view ;-)
>
> I need to index users and user contents that are subject to privacy levels
> like for instance :
>
> * Anyone
> * Only me
> * Only my friends
> * Only people i choose
>
> ...really classic.
>
> So, when an user searches for contents on the website, in the results, we
> can't show him the content elements he is not allowed to see.
>
> My first thought was : "There might be a way to do that with complex solr
> queries"
>
> So i start reading the documentation, and i have to say that i understand
> half of the things i read :-)
>
> And then, a new idea came to my mind. I was thinking about this process :
>
> 1- The user submits the search form with his keywords
> 2- I prepare a classic fulltext search query
> 3- I compute some way the friend list of the current user
> 4- I add a filter to the Solr query with the result of that
> 5- I send the query
>
> While this seems reasonable since i can add some cache system in the way to
> avoid computing the friend list each time, i don't know why, it doesn't feel
> right ;-)
>
> The other way would be to index users and users friends and somehow letting
> solr doing all the job.
>
> What do you think ? Is the second solution even possible ?
>
>
> Thanks !
> Br,
>
> Benjamin.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Content-privacy-search-index-tp3873462p3873462.html
> Sent from the Solr - User mailing list archive at Nabble.com.