You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Philip Durbin <ph...@harvard.edu> on 2014/11/18 21:47:34 UTC

Solr JOIN: keeping permission data out of primary documents

Solr JOINs are a way to enforce simple document security, as explained
by Yonik Seeley at
http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html

I'm trying to tweak this pattern so that I don't have to keep the
security information in each of my primary Solr documents.

I just posted the gist at
https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
my working Solr JOIN based on data in `before.json` . Permissions per
user are embedded in the primary documents like this:

    {
        "id": "dataset_3",
        "perms_ss": [
            "alice",
            "bob"
        ]
    },
    {
        "id": "dataset_4",
        "perms_ss": [
            "alice",
            "bob",
            "public"
        ]
    },

User document have been created to do the JOIN on:

    {
        "id": "alice",
        "groups_s": "alice"
    },

The JOIN looks like this:

{!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice

Because indexing the primary documents (datasets) takes a while, I'm
interested in exploring the idea of introducing a third type of
document that contains the permission information. `after.json` is an
example, with documents that look like this:

    {
        "id": "dataset_3"
    },
    {
        "id": "dataset_4"
    },
    {
        "id": "public",
        "groups_s": "public"
    },
    {
        "id": "alice",
        "groups_s": "alice"
    },
    {
        "id": "bob",
        "groups_s": "bob"
    },
    {
        "id": "charlie",
        "groups_s": "charlie"
    },
    {
        "id": "dataset_1_perms",
        "definition_point_s": "dataset_1",
        "role_assignee_ss": [
            "alice"
        ]
    },
    {
        "id": "dataset_2_perms",
        "definition_point_s": "dataset_2",
        "role_assignee_ss": [
            "bob"
        ]
    },

The question is if it's possible to construct a Solr JOIN such that
the same permissions are enforced and the same documents are returned
per user. The gist contains expected output and test runners for
anyone who can figure out the syntax of the JOIN. The idea is that
silence is golden and no output means the tests passed:

murphy:4d27fea7b431ef3bf4f9 pdurbin$ ./delete
{"responseHeader":{"status":0,"QTime":8}}
murphy:4d27fea7b431ef3bf4f9 pdurbin$ ./load.before
{"responseHeader":{"status":0,"QTime":12}}
murphy:4d27fea7b431ef3bf4f9 pdurbin$ ./test.before.all
murphy:4d27fea7b431ef3bf4f9 pdurbin$

What do people think? Can anyone load up "after.json", update the
FIXME's, and get `test.after.all` to work? Thanks in advance!

And thanks again for the original JOIN tip, Yonik!

Phil

-- 
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Re: Solr JOIN: keeping permission data out of primary documents

Posted by Philip Durbin <ph...@harvard.edu>.
On Wed, Nov 19, 2014 at 11:56 AM, Yonik Seeley <yo...@heliosearch.com> wrote:
> On Wed, Nov 19, 2014 at 9:22 AM, Philip Durbin
> <ph...@harvard.edu> wrote:
>> On Wed, Nov 19, 2014 at 5:45 AM, Yonik Seeley <yo...@heliosearch.com> wrote:
>>> On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
>>> <ph...@harvard.edu> wrote:
>>>> Solr JOINs are a way to enforce simple document security, as explained
>>>> by Yonik Seeley at
>>>> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>>>>
>>>> I'm trying to tweak this pattern so that I don't have to keep the
>>>> security information in each of my primary Solr documents.
>>>>
>>>> I just posted the gist at
>>>> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
>>>> my working Solr JOIN based on data in `before.json` . Permissions per
>>>> user are embedded in the primary documents like this:
>>>>
>>>>     {
>>>>         "id": "dataset_3",
>>>>         "perms_ss": [
>>>>             "alice",
>>>>             "bob"
>>>>         ]
>>>>     },
>>>>     {
>>>>         "id": "dataset_4",
>>>>         "perms_ss": [
>>>>             "alice",
>>>>             "bob",
>>>>             "public"
>>>>         ]
>>>>     },
>>>>
>>>> User document have been created to do the JOIN on:
>>>>
>>>>     {
>>>>         "id": "alice",
>>>>         "groups_s": "alice"
>>>>     },
>>>>
>>>> The JOIN looks like this:
>>>>
>>>> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice
>>>
>>> It would probably be faster written as a single join:
>>> fq={!join+from=groups_s+to=perms_ss}id:(public alice)
>>
>> Hmm, I can't get the single JOIN to work on the "before" example
>> (perms embedded in each primary doc) in the gist I posted so I guess
>> I'll live with the slower version with "OR".
>>
>>> Or, if you're using Heliosearch you could cache the filters separately
>>> for better hit rates on commonly used perms via the "filter" keyword:
>>> fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
>>> filter({!join+from=groups_s+to=perms_ss}id:alice)
>>
>> Getting back to my original question about keeping permission
>> information out of my primary documents, I noticed that
>> http://heliosearch.org describes the Pseudo-Join feature as "selects a
>> set of documents based on their relationship to a **second** set of
>> documents" (emphasis mine) so I assume I can't take the perms out of
>> my primary Solr documents and put them in a **third** set of
>> "permission assignments" documents with definition points and role
>> assignees: https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9#file-after-json
>> . That is, the three sets of documents would be:
>>
>> 1. primary (datasets, with no permission info)
>> 2. users
>> 3. permission assignments
>
> You should be able to chain joins to follow any number of links.
> I don't quite understand how you mean to use your schema... but something like
>
> fq={!join from=definition_point_s to=id}role_assignee_ss:alice
>
> That's only following a single link and ignoring the group_s field, so
> I'm probably missing something.

No, no, this is PERFECT! I think...

Again my goal is to get away from putting the permissions in the
primary documents.

In the "before" example, I put the permissions in the primary
documents. Then I JOIN on those documents using a secondary set of
"group" documents: the "public" group, the "alice" group, the "bob"
group, etc.

As of the commit below, using your suggestion, in the "after" example
I've taken the permissions out of the primary documents. Instead the
permissions go into a set of "permission assignments" documents. This
means that when permissions change, rather than re-indexing my primary
documents (which is a somewhat expensive operation with many database
calls), I think I'll be able to reindex only the "permission
assignments" documents. As you noted, the third set of documents about
"groups" aren't being used so I deleted them.

I'm going to play around with this in our actual code. Thanks, Yonik!

Phil

p.s. You were right about the single JOIN as well, so that's in the
commit too (looking for both the "alice" group and the "public" group
at the same time). In my haste I forgot that when testing this stuff
with curl I need to replace spaces with the plus (+) sign.

p.p.s. I can't seem to figure out how to link to a specific diff in a
gist but what you see below is the third revision. This one:
https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9/0c0a9120299e3b0c112dc1687b89de83598fcb02

murphy:4d27fea7b431ef3bf4f9 pdurbin$ git show 0c0a912 | cat
commit 0c0a9120299e3b0c112dc1687b89de83598fcb02
Author: Philip Durbin <ph...@gmail.com>
Date:   Wed Nov 19 12:48:00 2014 -0500

    A solution from Yonik Seeley! Permissions are gone from primary docs

    Details at http://lucene.472066.n3.nabble.com/Solr-JOIN-keeping-permission-data-out-of-primary-documents-tp4169739p4169934.html

diff --git a/after.json b/after.json
index dd817e5..c2516d9 100644
--- a/after.json
+++ b/after.json
@@ -12,22 +12,6 @@
         "id": "dataset_4"
     },
     {
-        "id": "public",
-        "groups_s": "public"
-    },
-    {
-        "id": "alice",
-        "groups_s": "alice"
-    },
-    {
-        "id": "bob",
-        "groups_s": "bob"
-    },
-    {
-        "id": "charlie",
-        "groups_s": "charlie"
-    },
-    {
         "id": "dataset_1_perms",
         "definition_point_s": "dataset_1",
         "role_assignee_ss": [
diff --git a/test.after.alice b/test.after.alice
index 4fbc13f..a6ceb16 100755
--- a/test.after.alice
+++ b/test.after.alice
@@ -1,2 +1,2 @@
 #!/bin/bash
-diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq=({!join+FIXME)'
| jq '.response.docs[] | {id}') alice.expected
+diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq={!join+from=definition_point_s+to=id}role_assignee_ss:(public+alice)'
| jq '.response.docs[] | {id}') alice.expected
diff --git a/test.after.bob b/test.after.bob
index 0e834e0..9ae57e7 100755
--- a/test.after.bob
+++ b/test.after.bob
@@ -1,2 +1,2 @@
 #!/bin/bash
-diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq=({!join+FIXME)'
| jq '.response.docs[] | {id}') bob.expected
+diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq={!join+from=definition_point_s+to=id}role_assignee_ss:(public+bob)'
| jq '.response.docs[] | {id}') bob.expected
diff --git a/test.after.charlie b/test.after.charlie
index 89176ad..1527c3f 100755
--- a/test.after.charlie
+++ b/test.after.charlie
@@ -1,2 +1,2 @@
 #!/bin/bash
-diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq=({!join+FIXME)'
| jq '.response.docs[] | {id}') charlie.expected
+diff <(curl -s --globoff
'http://localhost:8983/solr/collection1/select?rows=100&wt=json&indent=true&q=*%3A*&fq={!join+from=definition_point_s+to=id}role_assignee_ss:(public+charlie)'
| jq '.response.docs[] | {id}') charlie.expected
murphy:4d27fea7b431ef3bf4f9 pdurbin$


-- 
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Re: Solr JOIN: keeping permission data out of primary documents

Posted by Yonik Seeley <yo...@heliosearch.com>.
On Wed, Nov 19, 2014 at 9:22 AM, Philip Durbin
<ph...@harvard.edu> wrote:
> On Wed, Nov 19, 2014 at 5:45 AM, Yonik Seeley <yo...@heliosearch.com> wrote:
>> On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
>> <ph...@harvard.edu> wrote:
>>> Solr JOINs are a way to enforce simple document security, as explained
>>> by Yonik Seeley at
>>> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>>>
>>> I'm trying to tweak this pattern so that I don't have to keep the
>>> security information in each of my primary Solr documents.
>>>
>>> I just posted the gist at
>>> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
>>> my working Solr JOIN based on data in `before.json` . Permissions per
>>> user are embedded in the primary documents like this:
>>>
>>>     {
>>>         "id": "dataset_3",
>>>         "perms_ss": [
>>>             "alice",
>>>             "bob"
>>>         ]
>>>     },
>>>     {
>>>         "id": "dataset_4",
>>>         "perms_ss": [
>>>             "alice",
>>>             "bob",
>>>             "public"
>>>         ]
>>>     },
>>>
>>> User document have been created to do the JOIN on:
>>>
>>>     {
>>>         "id": "alice",
>>>         "groups_s": "alice"
>>>     },
>>>
>>> The JOIN looks like this:
>>>
>>> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice
>>
>> It would probably be faster written as a single join:
>> fq={!join+from=groups_s+to=perms_ss}id:(public alice)
>
> Hmm, I can't get the single JOIN to work on the "before" example
> (perms embedded in each primary doc) in the gist I posted so I guess
> I'll live with the slower version with "OR".
>
>> Or, if you're using Heliosearch you could cache the filters separately
>> for better hit rates on commonly used perms via the "filter" keyword:
>> fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
>> filter({!join+from=groups_s+to=perms_ss}id:alice)
>
> Getting back to my original question about keeping permission
> information out of my primary documents, I noticed that
> http://heliosearch.org describes the Pseudo-Join feature as "selects a
> set of documents based on their relationship to a **second** set of
> documents" (emphasis mine) so I assume I can't take the perms out of
> my primary Solr documents and put them in a **third** set of
> "permission assignments" documents with definition points and role
> assignees: https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9#file-after-json
> . That is, the three sets of documents would be:
>
> 1. primary (datasets, with no permission info)
> 2. users
> 3. permission assignments

You should be able to chain joins to follow any number of links.
I don't quite understand how you mean to use your schema... but something like

fq={!join from=definition_point_s to=id}role_assignee_ss:alice

That's only following a single link and ignoring the group_s field, so
I'm probably missing something.

-Yonik

Re: Solr JOIN: keeping permission data out of primary documents

Posted by Philip Durbin <ph...@harvard.edu>.
On Wed, Nov 19, 2014 at 5:45 AM, Yonik Seeley <yo...@heliosearch.com> wrote:
> On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
> <ph...@harvard.edu> wrote:
>> Solr JOINs are a way to enforce simple document security, as explained
>> by Yonik Seeley at
>> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>>
>> I'm trying to tweak this pattern so that I don't have to keep the
>> security information in each of my primary Solr documents.
>>
>> I just posted the gist at
>> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
>> my working Solr JOIN based on data in `before.json` . Permissions per
>> user are embedded in the primary documents like this:
>>
>>     {
>>         "id": "dataset_3",
>>         "perms_ss": [
>>             "alice",
>>             "bob"
>>         ]
>>     },
>>     {
>>         "id": "dataset_4",
>>         "perms_ss": [
>>             "alice",
>>             "bob",
>>             "public"
>>         ]
>>     },
>>
>> User document have been created to do the JOIN on:
>>
>>     {
>>         "id": "alice",
>>         "groups_s": "alice"
>>     },
>>
>> The JOIN looks like this:
>>
>> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice
>
> It would probably be faster written as a single join:
> fq={!join+from=groups_s+to=perms_ss}id:(public alice)

Hmm, I can't get the single JOIN to work on the "before" example
(perms embedded in each primary doc) in the gist I posted so I guess
I'll live with the slower version with "OR".

> Or, if you're using Heliosearch you could cache the filters separately
> for better hit rates on commonly used perms via the "filter" keyword:
> fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
> filter({!join+from=groups_s+to=perms_ss}id:alice)

Getting back to my original question about keeping permission
information out of my primary documents, I noticed that
http://heliosearch.org describes the Pseudo-Join feature as "selects a
set of documents based on their relationship to a **second** set of
documents" (emphasis mine) so I assume I can't take the perms out of
my primary Solr documents and put them in a **third** set of
"permission assignments" documents with definition points and role
assignees: https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9#file-after-json
. That is, the three sets of documents would be:

1. primary (datasets, with no permission info)
2. users
3. permission assignments

So, I guess I'll continue to embed permissions into the primary
documents, since it's working. :)

Thanks, Yonik. I appreciate you taking a look at this.

Phil

-- 
Philip Durbin
Software Developer for http://dataverse.org
http://www.iq.harvard.edu/people/philip-durbin

Re: Solr JOIN: keeping permission data out of primary documents

Posted by Yonik Seeley <yo...@heliosearch.com>.
On Tue, Nov 18, 2014 at 3:47 PM, Philip Durbin
<ph...@harvard.edu> wrote:
> Solr JOINs are a way to enforce simple document security, as explained
> by Yonik Seeley at
> http://lucene.472066.n3.nabble.com/document-level-security-filter-solution-for-Solr-tp4126992p4126994.html
>
> I'm trying to tweak this pattern so that I don't have to keep the
> security information in each of my primary Solr documents.
>
> I just posted the gist at
> https://gist.github.com/pdurbin/4d27fea7b431ef3bf4f9 as an example of
> my working Solr JOIN based on data in `before.json` . Permissions per
> user are embedded in the primary documents like this:
>
>     {
>         "id": "dataset_3",
>         "perms_ss": [
>             "alice",
>             "bob"
>         ]
>     },
>     {
>         "id": "dataset_4",
>         "perms_ss": [
>             "alice",
>             "bob",
>             "public"
>         ]
>     },
>
> User document have been created to do the JOIN on:
>
>     {
>         "id": "alice",
>         "groups_s": "alice"
>     },
>
> The JOIN looks like this:
>
> {!join+from=groups_s+to=perms_ss}id:public+OR+{!join+from=groups_s+to=perms_ss}id:alice

It would probably be faster written as a single join:
fq={!join+from=groups_s+to=perms_ss}id:(public alice)

Or, if you're using Heliosearch you could cache the filters separately
for better hit rates on commonly used perms via the "filter" keyword:
fq=filter({!join+from=groups_s+to=perms_ss}id:public) OR
filter({!join+from=groups_s+to=perms_ss}id:alice)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data